Smart Cloud Document Clustering and plagiarism checker using TF-IDF Based on Cosine Similarity

Sudhir Sahani, IMS Engineering College; Rajat Goyal ,IMS Engineering College; Saurabh Sharma ,IMS Engineering College, Shaili Gupta, IMS Engineering College

Algorithm, Cloud, Classification, Hierarchical, Clustering

This research paper describes the results oriented from experimental study of conventional document clustering techniques implemented in the commercial spaces so far. Particularly, we compared main approaches related to document clustering, agglomerative hierarchical document clustering and K-means. Though this paper, we generates and implement checker’s algorithms which deals with the duplicacy of the document content with the rest of the documents in the cloud. We also generate algorithm required to deals with the classification of the cloud data. The classification in this algorithm is done on the basis of the date of data uploaded and. We will take the ratio of both vectors and generate a score which rates the document in the classification.
    [1] Rajaraman, A.; Ullman, J. D. (2011). "Data Mining". Mining of Massive Datasets (PDF). pp. 1–17. doi:10.1017/CBO9781139058452.002. ISBN 9781139058452. [2] Beel, Joeran; Breitinger, Corinna (2017). "Evaluating the CC-IDF citation-weighting scheme - How effectively can 'Inverse Document Frequency' (IDF) be applied to references?" (PDF). Proceedings of the 12th iConference.” [3] Robertson, S. (2004). "Understanding inverse document frequency: On theoretical arguments for IDF". Journal of Documentation. 60 (5): 503–520. doi:10.1108/00220410410560582Valtchev, Stanimir S.; Baikova, Elena N.; Jorge, Luis R. (December 2012). "Electromagnetic Field as the Wireless Transporter of Energy" (PDF). FactaUniversitatis Ser. Electrical Engineering (Serbia: University of Niš) 25 (3): 171–181. doi:10.2298/FUEE1203171V [4] Manning, C. D.; Raghavan, P.; Schutze, H. (2008). "Scoring, term weighting, and the vector space model". Introduction to Information Retrieval (PDF). p. 100. doi:10.1017/CBO9780511809071.007. ISBN 9780511809071.Leyh, G. E.; Kennan, M. D. (September 28, 2008). Efficient wireless transmission of power using resonators with coupled electric fields (PDF). NAPS 2008 40th North American Power Symposium, Calgary, September 28–30, 2008. Inst. of Electrical and Electronic Engineers. pp. 1–4. doi:10.1109/NAPS.2008.5307364. [5] Singhal, Amit (2001). "Modern Information Retrieval: A Brief Overview". Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24 (4): 35–43./ [6] Graham L. Giller (2012). "The Statistical Properties of Random Bitstreams and the Sampling Distribution of Cosine Similarity". Giller Investments Research Notes (20121024/1). doi:10.2139/ssrn.2167044. [7] Huang, C., Simon, P., Hsieh, S., & Prevot, L. (2007) Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Word break Identification [8] Lovins, Julie Beth (1968). "Development of a Stemming Algorithm". Mechanical Translation and Computational Linguistics. 11: 22–31. [9] https://en.wikipedia.org/wiki/Plagiarism_detection [10] Baker, Brenda S. (February 1993), On Finding Duplication in Strings and Software (gs) (Technical Report), AT&T Bell Laboratories, NJ
Paper ID: GRDJEV02I050153
Published in: Volume : 2, Issue : 5
Publication Date: 2017-05-01
Page(s): 331 - 333