Association rule mining algorithm based on Spark for pesticide transaction data analyses

Xiaoning Bai; Jingdun Jia; Qiwen Wei; Shuaiqi Huang; Weicheng Du; Wanlin Gao

Authors

Xiaoning Bai 1. College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China; 2. Institute for the Control of Agrochemicals, Ministry of Agriculture and Rural Affairs, Beijing 100125, China; 3. Ministry of Science and Technology Torch Center, Beijing 100045, China
Jingdun Jia 1. College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China; 3. Ministry of Science and Technology Torch Center, Beijing 100045, China
Qiwen Wei 4. National Agricultural Technology Promotion Center, Beijing 100125, China
Shuaiqi Huang College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China
Weicheng Du 5. Information Center, Ministry of Agriculture and Rural Affairs, Beijing 100125, China
Wanlin Gao College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China

Keywords:

Spark, association rule mining, ICAMA algorithm, big data, pesticide regulation, MapReduce

Abstract

With the development of smart agriculture, the accumulation of data in the field of pesticide regulation has a certain scale. The pesticide transaction data collected by the Pesticide National Data Center alone produces more than 10 million records daily. However, due to the backward technical means, the existing pesticide supervision data lack deep mining and usage. The Apriori algorithm is one of the classic algorithms in association rule mining, but it needs to traverse the transaction database multiple times, which will cause an extra IO burden. Spark is an emerging big data parallel computing framework with advantages such as memory computing and flexible distributed data sets. Compared with the Hadoop MapReduce computing framework, IO performance was greatly improved. Therefore, this paper proposed an improved Apriori algorithm based on Spark framework, ICAMA. The MapReduce process was used to support the candidate set and then to generate the candidate set. After experimental comparison, when the data volume exceeds 250 Mb, the performance of Spark-based Apriori algorithm was 20% higher than that of the traditional Hadoop-based Apriori algorithm, and with the increase of data volume, the performance improvement was more obvious. Keywords: Spark, association rule mining, ICAMA algorithm, big data, pesticide regulation, MapReduce DOI: 10.25165/j.ijabe.20191205.4881 Citation: Bai X N, Jia J D, Wei Q W, Huang S Q, Du W C, Gao W L. Association rule mining algorithm based on Spark for pesticide transaction data analyses. Int J Agric & Biol Eng, 2019; 12(5): 162–166.

References

[1] Li D L. Internet of things and wisdom agriculture. Agricultural Engineering, 2012; 2(1): 1–7. (in Chinese)
[2] Zhao C J. Intelligent agriculture prospects, digital technology will have a new future. Marketing (Agricultural Resources and Markets), 2018; 18: 59–61. (in Chinese)
[3] Zhang M, Jin Y H, Zheng F T. Investigating the “One Farm Household, Two Production Systems” in rural China: The case of vegetable and fruit farmers. Annual Meeting of Agricultural and Applied Economics Association (AAEA), Boston, Massachusetts, July 31-August 2, 2016.
[4] Yan M J, Luo J, Liu J Y, Hou C W. IABS: parallel improved Apriori algorithm based on Spark. Application Research of Computers, 2017; 34(8): 2274–2277.
[5] Zhou Z H, Yang Q. Machine learning and its applications. Beijing: Tsinghua University Press, 2011.
[6] Apache Hadoop. http://hadoop.apache.org/.
[7] Lin J X, Huang Z. An improved Apriori algorithm based on array vectors. Computer Applications and Software, 2011; 28(5): 268–271.
[8] Cao Y, Miao Z G, Zhang H X. Application research about degree warning based on improved Apriori algorithm. Computer Development & Applications, 2014; 27(6): 1–3. (in Chinese)
[9] Zhao X J, Sun Z X, Yuan Y, Chen Y. An improved Apriori algorithm based on orthogonal list storage. Journal of Chinese Computer Systems, 2016; 37(10): 2291–2295. (in Chinese)
[10] Mahout. http://mahout.apache.org/.
[11] Distributed computing. http://baike.baidu.com/view/7011548.htm.
[12] Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008; 51(1): 107–113.
[13] Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, April 25-27, 2012.
[14] Mllib. http://spark.apache.org/mllib/.
[15] Low Y C, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein J. Graphlab: A new framework for parallel machine learning. https://arxiv.org/ftp/arxiv/papers/1408/1408.2041.pdf.
[16] Berkhin P. A survey of clustering data mining techniques. Grouping multidimensional data. Springer, Berlin Heidelberg, 2006; pp.25–71.
[17] Hu S J. Parallel data mining algorithm research in cloud. Chengdu: University of Electronic Science and Technology of China, 2013. (in Chinese)
[18] Tian S P, Wu W L. Algorithm of automatic gained parameter value k based on dynamic k-means. Computer Engineering and Design, 2011; 32(1): 274–276. (in Chinese)
[19] Suo H G, Wang Y W. Reference-based k-means algorithm for document clustering. Computer Engineering and Design, 2009; 2: 401–403,407. (in Chinese)
[20] He Z, Qian J S. A multicenter clustering algorithm for automatic acquisition of K values. Electronics World, 2012; 4: 60–64. (in Chinese)
[21] Bloodgood M, Ye P, Rodrigues P, Zajic D, Doermann D. A random forest system combination approach for error detection in digital dictionaries. Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data. Association for Computational Linguistics, 2012; pp.78–86
[22] Mahdi M U. Determining number and initial seeds of K-means clustring using GA. Journal of Babylon University/Pure and Applied Sciences, 2010; 18(3): 1–6.
[23] Lu S L, Lin S M. Distance-based outliers detection and applications. Computer and Digital Engineering, 2004; 32(5): 94–97. (in Chinese)
[24] Granitto P M, Furlanello C, Biasioli F, Gasperi F. Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometrics and Intelligent Laboratory Systems, 2006; 83(2): 83–90.
[25] Zheng L Z, Huang D C. Outliers detection and semi-supervised clustering algorithm based on shared nearest neighbors. Computer Systems and Applications, 2012; 21(2): 117–121. (in Chinese)
[26] Ho T K. The random subspace method for constructing decision forests. EEE Trans. Pattern Analysis and Machine Intelligence, 1998; 20(8): 832–844.
[27] Genuer R, Poggi J M, Tuleau-Malot C. Variable selection using random forests. Pattern Recognition Letters, 2010; 31(14): 2225–2236.
[28] Shin M, Kang E M, Park S H. Automatically finding good clusters with seed k-means. Genome Informatics Series, 2003; pp.326-327.
[29] Arthur D, Vassilvitskii S. K-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, 2007; pp.1027–1035.
[30] Yong K. Research on feature selection and model optimization of random forest. Harbin: Harbin Institute of Technology, 2008. (in Chinese)
[31] Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S. Scalable k-means++. Proceedings of the VLDB Endowment, 2012; 5(7): 622–633.
[32] Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. Proceedings of the 23rd International Conference on Machine Learning, ACM, 2006; pp.161–168.
[33] Breiman L. Random forests. Machine Learning, 2001; 45(1): 5–32.
[34] Bootstrap. http://en.wikipedia.org/wiki/Bootstrap_aggregating.

Association rule mining algorithm based on Spark for pesticide transaction data analyses

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

IJABE is an international peer reviewed open access journal, adopting Creative Commons Copyright Notices as follows.

Authors who publish with this journal agree to the following terms:

Information

Developed By