International Journal of Technology Enhancements and Emerging Engineering Research (ISSN 2347-4289)

IJTEEE >> Volume 3 - Issue 7, July 2015 Edition

International Journal of Technology Enhancements and Emerging Engineering Research  
International Journal of Technology Enhancements and Emerging Engineering Research

Website: http://www.ijteee.org

ISSN 2347-4289

Framework For ETL With Hadoop Map Reduce

[Full Text]



Jaswender Malik, Kavita



Keywords: ETL, Handler, Usage, Conclusion



Abstract: Big Data is dealt by every organization which serves large number of users. Efficiently fetching, transferring, storing, cleaning, sanitizing, querying and extracting information from Big Data is a daunting task because a single machine and the traditional algorithms can’t handle this staggering amount of data tractably. Now not all data comes in the form that can be directly processed by automated programs. Before feeding the data into huge data processing systems[1]. It is necessary to treat raw data to convert it into a consistent format. This is done using data cleaning, sanitization and transformation operations. In this paper we present a neat framework for data cleaning and transformation operation which can be integrated in existing Map Reduce (Hadoop) infrastructures. This framework can be standardized and be adopted by corporations for their Big Data processing tasks.



[1] Dean, J. and S. Ghemawat (2008). Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1), 107–113. ISSN 0001-0782. URL http://doi. acm.org/10.1145/1327452.1327492.

[2] Agrawal, D., S. Das, and A. El Abbadi, Big data and cloud computing: current state and future opportunities. In Proceedings of the 14th International Conference on Ex-tending Database Technology, EDBT/ICDT ’11. ACM, New York, NY, USA, 2011. ISBN 978-1-4503-0528-0. URL http://doi.acm.org/10.1145/1951365. 1951432.

[3] Jacobs, A. (2009). The pathologies of big data. Commun. ACM, 52(8), 36–44. ISSN 0001-0782. URL http://doi.acm.org/10.1145/1536616.1536632.

[4] El Akkaoui, Z., E. Zimŕnyi, J.-N. Mazón, and J. Trujillo, A model-driven framework for etl process development. In Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP, DOLAP ’11. ACM, New York, NY, USA, 2011. ISBN 978-1-4503-0963-9. URL http://doi.acm.org/10.1145/2064676. 2064685.

[5] Shvachko, K., H. Kuang, S. Radia, and R. Chansler, The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST ’10. IEEE Computer Society, Washington, DC, USA, 2010. ISBN 978-1-4244-7152-2. URL http://dx.doi.org/10.1109/MSST. 2010.5496972.

[6] Hecht, R. and S. Jablonski, Nosql evaluation: A use case oriented survey. In Pro-ceedings of the 2011 International Conference on Cloud and Service Computing, CSC ’11. IEEE Computer Society, Washington, DC, USA, 2011. ISBN 978-1-4577-1635-5. URL http://dx.doi.org/10.1109/CSC.2011.6138544.

[7] Chaudhuri, S. and U. Dayal (1997). An overview of data warehousing and olap tech-nology. SIGMOD Rec., 26(1), 65–74. ISSN 0163-5808. URL http://doi.acm. org/10.1145/248603.248616.

[8] http://en.wikipedia.org/wiki/Extract,_transform,_load