Show All Reviews Monitoring Distributed Data Streams Through Node Clustering PaperID: 1161 Authors: Maria Barouti, Daniel Keren, Jacob Kogan, and Yaakov Malinovsky Keywords: Applications of Clustering, Mining text documents, Text Mining Reviewer 1's opinion Ranking Criteria Name Rank Appropriateness to the Conference: Weak Accept Originality: Marginal Technical Strength: Weak Accept Presentation: Weak Reject Overall Evaluation: Marginal Comments: In this paper, the authors present a strategy to decrease the communication overhead while monitoring data streams in distributed systems. In particular, they propose to apply a clustering algorithm to nodes in order to introduce an intermediate control step, before comunicating the violation of a threshold value to the central node. Contrary to classical clustering algorithms, the proposed algorithm has an objective function which aims to maximize the dissimilarity in the same cluster. Although the main idea seems to be promising, the paper has some issues: - All the sections appear full of equations, also referring to results obtained in other works. On the other hand, there is a lack of intuitive explanation of some presented concepts. For this reason, the whole paper appears difficult to be followed. - Section 2. The application in Text Mining appears not well connected to the rest of paper. Moreover, the equation (4), which should represent a measure of the information gain of a feature, is not clear. In particular, it is not clear how the feature is represented in the formula, since the sum loops on the values x11, x12, x21 and x22 which are independent on the given feature. ----------------------------------------------------------------------------------------- i do emphasize in a couple of places that x11, x12, x21 and x22 depend on the selected "feature." ----------------------------------------------------------------------------------------- - Section 2. The meaning of the relevance label r is not clear. ---------------------------------------------------------------------------------------- I "shift" the blame to 2007 paper by sharfman, schuster, and keren ---------------------------------------------------------------------------------------- - Section 3. The proposed node clustering approach appears motivated by the results reported in Table 1. However, the authors only state "The results immediately suggest to cluster nodes to further reduce communication load", without giving any explanation of the (possible) intuitions coming from such results. ---------------------------------------------------------------------------------------- I did attempt to insert some intuition considering clustering the "longest" and "shortest" vectors. ---------------------------------------------------------------------------------------- - Section 4. In the Equation (8), the sum should loop on the values {1,...,k} instead of on the values {1,...,n}. ---------------------------------------------------------------------------------------- I did change n for k ---------------------------------------------------------------------------------------- Moreover, the reported algorithm to compute the history vectors hn(tj) is not clear. What the authors mean for "for t increasing from tj to tj+1"? (it seems to be a single iteration) ---------------------------------------------------------------------------------------- i insert a sentence to recall that tj stands for time of the entire dataset mean update, t=tj+1,tj+2,...,t(j+1) ---------------------------------------------------------------------------------------- Moreover, in the same algorithm, a factor of 1/2 appears without any comment about its effect. Why do the author set such factor to 1/2 and not to another value? ---------------------------------------------------------------------------------------- i put a short remark that 1/2 is selected arbitrarily, and a more sophisticated approach may lead to variable weights derived from the history of the process. ---------------------------------------------------------------------------------------- - Section 4. The proposed clustering algorithm seems to work in a bottom-up agglomerative fashion. However, the large set of formula let such simple approach to appear much more complicated. The description should be highly simplified. ---------------------------------------------------------------------------------------- the current description contains two steps, and takes about 1/4 of a page to describe. i do not see how it can be "highly simplified." i, therefore, do not touch it. ---------------------------------------------------------------------------------------- - Section 5. The authors apply their clustering approach in a scenario with only 10 nodes. I think that the advantages of a clustering approach should be evaluated in a scenario with a much higher number of nodes. ---------------------------------------------------------------------------------------- i take this as a suggestion for future work and/or a BSF proposal (exactly as the 1/2 weights statement). ---------------------------------------------------------------------------------------- Moreover, the authors do not report the number of identified clusters in the performed experiments. ---------------------------------------------------------------------------------------- the aim was to show that clustering may reduce communication, and the paper does this. i did not think that somebody may also be interested in the number of clusters. this information is available, and in the future we can include it. i suggest to avoid work we consider non essential right now--the submission deadline is looming, and there are many things that should be done at this time (i think that the "right" choice of \alpha is much more interesting and important issue). ---------------------------------------------------------------------------------------- Reviewer 2's opinion Ranking Criteria Name Rank Appropriateness to the Conference: Strong Accept Originality: Weak Accept Technical Strength: Weak Accept Presentation: Strong Accept Overall Evaluation: Weak Accept Comments: The paper presents an alternative approach to monitoring data streams in a distributed system. This approach combines system theory techniques and clustering. The difference with respect to other clustering algorithms that looks for similar data is that monitoring requires clusters with dissimilar vectors able to cancel each other as much as possible. I am not an expert on this topic but the paper seems correct, it is clearly written and the explanation of methods appear to be clear and complete enough. My doubts are due to the fact that the number of experiments and the comparison with state of the art seem quite limited. ---------------------------------------------------------------------------------------- the point about "the comparison with state of the art" is valid, but i suggest to postpone this issue to the proposal preparation. ---------------------------------------------------------------------------------------- Reviewer 3's opinion Ranking Criteria Name Rank Appropriateness to the Conference: Weak Accept Originality: Weak Accept Technical Strength: Weak Accept Presentation: Weak Accept Overall Evaluation: Weak Accept Comments: The paper presents a new technique based on systems theory and clustering for distributed data stream monitoring. The paper is novel and technically correct. Simulation results could be strengthened by better presentation of main benefits of the new approach. A better validation of a new method by comparing it with a larger number of competing approaches would improve the paper significantly. ---------------------------------------------------------------------------------------- see my comment above. ---------------------------------------------------------------------------------------- Powered by IAPRCommence