Re: SegmentFilter

2010-02-21 Thread Andrzej Bialecki
On 2010-02-20 23:32, reinhard schwab wrote: Andrzej Bialecki schrieb: On 2010-02-20 22:45, reinhard schwab wrote: the content of one page is stored even 7 times. http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8 i believe this comes from Recno:: 383 URL::

Re: SegmentFilter

2010-02-21 Thread reinhard schwab
Andrzej Bialecki schrieb: On 2010-02-21 12:36, reinhard schwab wrote: Andrzej Bialecki schrieb: On 2010-02-20 23:32, reinhard schwab wrote: Andrzej Bialecki schrieb: On 2010-02-20 22:45, reinhard schwab wrote: the content of one page is stored even 7 times.

Re: SegmentFilter

2010-02-20 Thread reinhard schwab
the content of one page is stored even 7 times. http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8 i believe this comes from Recno:: 383 URL:: http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519 Content:: Version: -1 url:

Re: SegmentFilter

2010-02-20 Thread Andrzej Bialecki
On 2010-02-20 22:45, reinhard schwab wrote: the content of one page is stored even 7 times. http://www.cinema-paradiso.at/gallery2/main.php?g2_page=8 i believe this comes from Recno:: 383 URL:: http://www.cinema-paradiso.at/gallery2/main.php?g2_highlightId=54519 Duplicate content is usually

Re: SegmentFilter

2010-02-19 Thread reinhard schwab
i implement now this tool by forking SegmentMerger. i have only added an additional filter in the map method and keep the segment name. i have then be surprised, that the reduce method logs 4 times the content of a crawl datum. why this? i have logged then the content objects and they seem to be

SegmentFilter

2010-02-14 Thread reinhard schwab
i would like to have a segment filter, which filters out unneeded content. i only want to keep the content of pages which are still indexed in solr and which belong to this segment, when i query solr by this segment name. is there any existing tool available? SegmentMerger is a no go for me. it