B659 Web Mining Project's Journal
[Most Recent Entries]
[Calendar View]
[Friends]
Below are the 13 most recent journal entries recorded in
B659 Web Mining Project's LiveJournal:
| Friday, April 21st, 2006 | 4:07 pm [dyrecorn]
 |
Project Discussion elwoodthegreat and I met today to discuss the current state of the project and future directions. Points we discussed: For LJ parsing, each entry has its own URL - we can parse these out. The way I divided the pages, crawling the LJs are more like doing a search on how consistently on-topic the entire page is, rather than a history-based approach. Still, this is interesting. I suggested that in addition to a re-ranking, we also present a link off of each entry in the list that links to the WBM's copy of the version of the page that provided the highest acceleration. In this way, we're providing something closer to Google's original ranking - a list of the most highly-rated pages. The realization that prompted this was our awareness that the current project has drifted a little in scope. What we're presenting is not so much a competitor to Google's ranking. The reason for this is that the acceleration model can rank a page highly if it's been very on-topic in the past, even if the current version isn't. THis means that a user visiting the current version looking for immediate payoff may be confused by our ranking. What we're actually finding is something more akin to "which pages should I come back to / bookmark for consistent information on topic X?" The pages we rank highly have a high probability of presenting new information on a given topic over time. Thus, for our project, a user study would have to be done over a period of time, such as 6 months, comparing Google's current top rankings for a query to our top ranking over the same time period. This is one of the reasons I then proposed we keep the most highly-ranked copies, because that allows us to present something mroe akin to what Google is - an instant payoff of the desired information. | 1:56 pm [dyrecorn]
 |
| | Wednesday, April 19th, 2006 | 12:38 pm [elwoodthegreat] |
| | Tuesday, April 4th, 2006 | 7:56 pm [elwoodthegreat] |
I felt like rewriting the program to make it more modular for the future addition of crawlers for different archives than just the Wayback Machine. While I was at it, I decided to take a shot at implementing the acceleration model. Here is the body of the accelerate() method: double dragConstant = 0.2; public void accelerate(double value, double mass, double time){ if(mass==0){ return; }else{ velocity += (((value - (dragConstant * velocity))/mass) * (time/100000000)); } if(velocity < 0){ velocity = 0; } } Here value (or Force) is the number of matches found on an archived page scaled by its dissimilarity to the previous page, mass is the number of shingles the page has, and time is the amount of time between this archived page and the previous page (in milliseconds). I experimented briefly with different dragConstants on the comparison query Google OR FUD between slashdot.org and boingboing.net The following graphs show the outcome at dragConstant = 0.01, 0.1, and 0.2.    I think 0.2 worked fairly well in this case and might even think about increasing it, but I haven't tested it on queries that produce mediocre or results worse than these two pages. More graphs and results to come soon. Please feel free to give us any queries you want tested. | | Monday, March 27th, 2006 | 1:44 pm [elwoodthegreat] |
Lots of updates to the program: From just before break: I fixed that problem we were having with StringNode's not being what I thought they were. I did it by changing the website into a all lowercase string of viewable text (StringBean) and then counting occurrences of keywords or quoted strings by using regular expressions. How I handled the boolean expressions is as follows: keyword1 AND keyword2 = min(count(keyword1),count(keyword2)) keyword1 OR keyword2 = count(keyword1) + count(keyword2) keyword1 NOT keyword2 = if(count(keyword2)>0) return 0 else count(keyword2) Nesting of boolean expressions is done the same as above. This seemed like the best definitions for these expressions in my opinion. With this and the shingling, I performed the query: Bloomington Indiana public (Transportation OR Transit) and came up with decent results except that some pages were static and had many references versus www.bloomingtontransit.com which only had a few per archived page. So this looks really good considering that this should all be resolved when the acceleration/velocity model gets added. If we want to do direct comparisons between Google's ranking and this program, we will have to transform the query syntax to be more Google-like at runtime before submitting it to the Google. From today: Fixed small but significant bug with matching system implemented above. This partly fixed the problem mentioned above, but the acceleration model should hopefully rectify it completely. Added a comparison mode so the user can enter the urls to be queried. For example: The query "FUD" on http://slashdot.org and http://boingboing.net resulted in the following ranking: #1: http://boingboing.net with a score of 0.09062081019654906 #2: http://slashdot.com with a score of 0.035449470969946446 I also reduced the memory overhead a little so processing the above example is possible. Other things to work on include: -Implement the acceleration model -Making the number of Google results to rank user specified -Converting the query to Google syntax before executing it -Adding a nifty interface | | Monday, March 6th, 2006 | 10:51 am [elwoodthegreat] |
Updated Prototype
Last week, I updated the prototype: 1) Refactored the code so it will be much easier to extend and modify 2) Added sampling normalization so only the sparsest sampling rate will be used. In doing this I measured the sample density as: # of entries / (Last entry - First entry) I thought this might be better than: # of entries / (Current time - First entry) due to the fact that a page might have stopped existing in the archive. Although the chances that we have to deal with a page like this is dependent on the recency of Google's index. Next things to do are integrate the acceleration model and scaling based upon page change. | | Wednesday, February 22nd, 2006 | 5:34 pm [dyrecorn]
 |
Future work
Fil notes we could also use this to do page analysis to reveal the most important concepts in a given site. I note we could then use vectors of these terms to do community building in a group of sites. TF*IDF*TP (term preservation)? Will use term frequency for the normal project. | 5:22 pm [dyrecorn]
 |
Spoke with Fil today: He points out the the sampling bias causes us to have normalization problems, which we should address before we look into adding other metrics. He suggests that we only take, say 10 images for each page. But what if a page has existed for 10 years vs one that has only existed for 1 year. He suggests possibly only chosing 1 update per month. Normalize based on a time unit, say 1 per 3 months or 1 per 3 months. Minimum sampling (with a max) based on the sparsest one. We can probably use shingling for duplicate detection. Ideas to score a page higher based on the amount of change as well as the frequency of term occurence. | | Friday, February 17th, 2006 | 5:05 pm [dyrecorn]
 |
Meeting Update elwoodthegreat and I met today to discuss the project. After a code review, we came to some conclusions. I need to review and add onto the scoring code and the "acceleration" code that updates a page's score. We decided that we might encounter a problem if a given page has many updates stored in the WBM that are generally cosmetic - i.e. last modified changes and such. This would give it a high score in our algorithm because those terms would remain in the page over a larger period of time. We decided that we should talk to Fil about ways to store and detect duplicates, and only score a page if it deviated by a certain amount from previous versions. (We might only have to store the previous version for this) Also, should we score a page based on the amount by which it deviates from a previous version? Some of the math in the page scoring seems possibly off since we're getting a lot of 0s, and it seems like we shouldn't. We decided that hitting the WBM with 10 threads at a time is not that bad, so we're leaving it as is so that it runs faster. If we ever need to run multiple copies we might change it to sequential so we're not bombarding the WBM. | | Monday, February 13th, 2006 | 10:04 am [elwoodthegreat] |
Clustering Idea
While reading a part of Chapter 4, I was considering the following approach: Find the centroid of the TFIDF space of all pages in the archive for a given result page Rank based upon distance or cosine similarity between query and result centroids This would not favor those that "changed and mentioned the query" or account for recency at all, but it would be another interesting measure that seems somewhat intuitive. | 9:47 am [elwoodthegreat] |
| | Tuesday, January 31st, 2006 | 10:44 am [dyrecorn]
 |
B659 - Project Proposal
While many way of ranking results from a user query have been explored, most of these focus on some heuristic that only considers the most recent version of the page as seen by a crawler. Due to the impossible task of visiting every page in the Web even once, less effort has been made on doing any temporal analysis of web pages, as this requires a history of the page that must be built by re-visiting the page over a span of time. For small subsets of the Web this is feasible, but for the scale required by a major crawler, less so. In our project we propose to exploit an already existing archival resource to enable us to attempt some temporal measures. The resource we will be using is the Wayback Machine Internet Archive. Founded in 1996 to act as an Internet Library, the Wayback Machine archives multiple versions of many popular pages over a span of years. Interfaces are available for querying this resource, and we expect we will be able to make use of these. Our project will start with a user submitted query string. This query string will be sent to Google using the Google API, and we will receive the top 10 results from Google. We will then, for each resultant URL, use the Wayback Machine to apply a heuristic to measuring the URL’s value over time. Finally, we will re-rank the top 10 by this new score and compare the differences between our ranking and Google’s. Time permitting, we will measure several different heuristics, however, the general model we propose to use is an inertial model. In general, this works by treating a URL’s score as a velocity starting at 0 and subject to decay, and positive measures of a particular version of that URL at a particular time are treated as acceleration. So, for example, if we have 10 versions of a URL over time, the URL’s score will start at 0. If the first 3 pages are good, the URL will experience rapid “acceleration” and achieve a high “velocity”. However, if the next 7 versions are bad, no further acceleration will be applied and so the velocity will decay down over time. The final velocity will be the URL’s score. The way in which we actually measure the quality of a given version of a given page will also be another place where we might try several different measures. Most likely, we will initially try a naïve search in which we simply do keyword matching based on frequency and possibly proximity. Assuming we can complete this on schedule, we can move to more complicated measures, although it is unlikely that we will be able to use measures such as PageRank due to the large demand this would place on the Wayback Machine’s servers. Our results for this project, while unlikely to include a user study, will still illustrate whether our general concept provides a different and potentially useful way of looking at webpage ranking. If this turns out to be the case, expanding the work may provide an opportunity for future research. | | Monday, January 30th, 2006 | 3:32 pm [dyrecorn]
 |
|
|