date:20071122

Re: Memory use with sorting problem

2007-11-22 Thread Chris Laux

Thanks for your reply. I made some memory saving changes, as per your
advice, but the problem remains.

 Set the max warming searchers to 1 to ensure that you never have more
 than one warming at the same time.

Done.

 How many documents are in your index?

Currently about 8 million.

 If you don't need range queries on these numeric fields, you might try
 switching from sfloat to float and from sint to int.  The
 fieldCache representation will be smaller.

As far as I can see slong etc. is also needed for sorting queries
(which I do, as mentioned). Anyway, I got an error message when I tried
sorting on a long field.

 Is it normal to need that much Memory for such a small index?
 
 Some things are more related to the number of unique terms or the
 numer of documents more than the size of the index.

Is there a manageable way to find out / limit the number of unique terms
in Solr?

Cheers,

Chris

Re: Performance problems for OR-queries

2007-11-22 Thread Jörg Kiegeland

1. Does Solr support this kind of index access with better performance ?
Is there anything special to define in schema.xml?

No... Solr uses Lucene at it's core, and all matching documents for a
query are scored.

So it is not possible to have a google like performance with Solr,
i.e. to search for a set of keywords and only the 10 best documents are
listed, without touching the other millions of (web) documents matching
less keywords.
I infact would not know how to program such an index, however google has
done it somehow..

2. Can one switch off this ordering and just return any 100 documents
fullfilling the query (though getting best-matching documents would be
a nice feature if it would be fast)?

a feature like this could be developed... but what is the usecase for
this? What are you tring to accomplish where either relevancy or
complete matching doesn't matter? There may be an easier workaround
for your specific case.

This is not an actual Use-Case for my project, however I just wanted to
know if it would be possible.

Because of the performance results, we designed a new type of query. I
would like to know how fast it would be before I implement the following
query:

I have N keywords and execute a query of the form

keyword1 AND keyword2 AND .. AND keywordN

there may be again some millions of matching documents and I want to get the
first 100 documents.
To have a ordering criteria, each Solr document has a field named REV which
has a natural number. The returned 100 documents shall be those with
the lowest numbers in the REV field.

My questions now are:

(1) Will the query perform in O(100) or in O(all possible matches)?

(2) If the answer to (1) is O(all possible matches), what will be the performance if I
dont order for the REV field? Will Solr order it after the point of time
where a document was created/modified? What I have to do to get O(100) complexity finally?

Thanks

Jörg

Re: Document update based on ID

2007-11-22 Thread Jörg Kiegeland




Yes, SOLR-139 will eventually do what you need.

The most recent patch should not be *too* hard to get running (it may 
not apply cleanly though)  The patch as is needs to be reworked before 
it will go into trunk.  I hope this will happen in the next month or so.


As for production?  It depends ;)  The API will most likely change so 
if you base your code on the current patch, it will need to change 
when things finalize.  As for stability, it has worked well for me 
(and I think for Erik)


A useful feature would be update based on query, so that documents 
matching the query condition will all be modified in the same way on the 
given update fields.

If this feature also available in future?

Re: Strange behavior MoreLikeThis Feature

2007-11-22 Thread Ryan McKinley



Now when I run the following query:
http://localhost:8080/solr/mlt?q=id:neardup06mlt.fl=featuresmlt.mindf=1mlt.mintf=1mlt.displayTerms=detailswt=jsonindent=on



try adding:
 debugQuery=on

to your query string and you can see why each document matches...

My guess is that features uses a text field with stemming and a 
stemmed word matches


ryan

Re: Document update based on ID

2007-11-22 Thread Ryan McKinley


Jörg Kiegeland wrote:



Yes, SOLR-139 will eventually do what you need.

The most recent patch should not be *too* hard to get running (it may 
not apply cleanly though)  The patch as is needs to be reworked before 
it will go into trunk.  I hope this will happen in the next month or so.


As for production?  It depends ;)  The API will most likely change so 
if you base your code on the current patch, it will need to change 
when things finalize.  As for stability, it has worked well for me 
(and I think for Erik)


A useful feature would be update based on query, so that documents 
matching the query condition will all be modified in the same way on the 
given update fields.

If this feature also available in future?



interesting, I had not thought of that - but it could be useful. 
(Potentially dangerous and resource intensive, but so is rm)


Can you add a comment to SOLR-139 with this idea?  Once SOLR-139 is more 
stable, it would make sense to do this as a new issue.


ryan

Grouping multiValued fields

2007-11-22 Thread Mark Baird

Let's say I have a class Item that has a collection of Sell objects.
Sell objects have two properties sellingTime (Date) and salesPerson
(String).
So in my Solr schema I have something like the following fields defined:

field name=id type=string indexed=true stored=true required=true /
field name=sellingTime type=date indexed=true stored=false
multiValued=true /
field name=salesPerson type=text indexed=true stored=false
multiValued=true /

An add might look like the following:

add
  doc
field name=id1/field
field name=sellingTime2007-11-23T23:01:00Z/field
field name=salesPersonJohn Doe/field
  /doc
  doc
field name=id2/field
field name=sellingTime2007-12-24T01:15:00Z/field
field name=salesPersonJohn Doe/field
field name=sellingTime2007-11-23T21:11:00Z/field
field name=salesPersonJack Smith/field
  /doc
/add


My problem is that all the historical sales data for the items are
getting flattened out.
I need the sellingTime and salesPerson fields to be kept as a pair
somehow, but I need to store the data as a seperate date field so that
I can do range searches.

Specifically I want to be able to do the following search:
salesPerson:John Doe AND sellingTime:[2007-11-23T00:0:00Z TO
2007-11-24T00:00:00Z]
Right now that query would return both items 1 and 2, but I want it to
only return item 1.

Is there some trick to get this query to work as I want it to?  Or do
I need to totally restructure my data?

Heritrix and Solr

2007-11-22 Thread George Everitt

I'm looking for a web crawler to use with Solr.  The objective is to  
crawl about a dozen public web sites regarding a specific topic.


After a lot of googling, I came across Heritrix, which seems to be the  
most robust well supported open source crawler out there.   Heritrix  
has an integration with Nutch (NutchWax), but not with Solr.   I'm  
wondering if anybody can share any experience using Heritrix with Solr.


It seems that there are three options for integration:

1. Write a custom Heritrix Writer class which submits documents to  
Solr for indexing.
2. Write an ARC to Sol input XML format converter to import the ARC  
files.
3. Use the filesystem mirror writer and then another program to walk  
the downloaded files.


Has anybody looked into this or have any suggestions on an alternative  
approach?  The optimal answer would be You dummy, just use XXX to  
crawl your web sites - there's no 'integration' required at all.   Can  
you believe the temerity?   What a poltroon.


Yours in Revolution,
George

Re: Heritrix and Solr

2007-11-22 Thread A. Banji Oyebisi





I am interested in this too. any ideas?





A. Banji Oyebisi
Choicegen, LLC.
Email: [EMAIL PROTECTED]
Web URL: http://www.choicegen.com
Choicegen... Helping you make better choices!
 

 Notice:  This email message, together with any attachments, may contain information  of  Choicegen,  LLC.,  its subsidiaries  and  affiliated
 entities,  that may be confidential,  proprietary,  copyrighted  and/or legally privileged, and is intended solely for the use of the individual
 or entity named in this message. If you are not the intended recipient, and have received this message in error, please immediately return this
 by email and then delete it.

 
 




George Everitt wrote:
I'm looking for a web crawler to use with Solr. The
objective is to crawl about a dozen public web sites regarding a
specific topic.
  
  
After a lot of googling, I came across Heritrix, which seems to be the
most robust well supported open source crawler out there. Heritrix
has an integration with Nutch (NutchWax), but not with Solr. I'm
wondering if anybody can share any experience using Heritrix with Solr.
  
  
It seems that there are three options for integration:
  
  
1. Write a custom Heritrix "Writer" class which submits documents to
Solr for indexing.
  
2. Write an ARC to Sol input XML format converter to import the ARC
files.
  
3. Use the filesystem mirror writer and then another program to walk
the downloaded files.
  
  
Has anybody looked into this or have any suggestions on an alternative
approach? The optimal answer would be "You dummy, just use XXX to
crawl your web sites - there's no 'integration' required at all. Can
you believe the temerity? What a poltroon."
  
  
Yours in Revolution,
  
George

Re: Heritrix and Solr

2007-11-22 Thread Cool Coder

I have some sort of same requirement where I need to move to a good crawler. 
Currently I am using a custom crawler, I mean my own crawler to crawl some 
public domains and uses Lucene to index all downloaded pages. After doing lots 
of research I came across JSpider with Lucene.
  ALso I was looking for Nutch for doing crawler job but I dont think that is 
possible, I mean feasible.
   
  - BR

A. Banji Oyebisi [EMAIL PROTECTED] wrote:
  I am interested in this too. any ideas?

A. Banji Oyebisi  Choicegen, LLC.  Email: [EMAIL PROTECTED]  Web URL: 
http://www.choicegen.com  Choicegen... Helping you make better choices!

  Notice:  This email message, together with any attachments, may contain 
information  of  Choicegen,  LLC.,  its subsidiaries  and  affiliated   
entities,  that may be confidential,  proprietary,  copyrighted  and/or legally 
privileged, and is intended solely for the use of the individual   or entity 
named in this message. If you are not the intended recipient, and have received 
this message in error, please immediately return this   by email and then 
delete it.




George Everitt wrote:   I'm looking for a web crawler to use with Solr.  The 
objective is to crawl about a dozen public web sites regarding a specific 
topic. 

After a lot of googling, I came across Heritrix, which seems to be the most 
robust well supported open source crawler out there.   Heritrix has an 
integration with Nutch (NutchWax), but not with Solr.   I'm wondering if 
anybody can share any experience using Heritrix with Solr. 

It seems that there are three options for integration: 

1. Write a custom Heritrix Writer class which submits documents to Solr for 
indexing. 
2. Write an ARC to Sol input XML format converter to import the ARC files. 
3. Use the filesystem mirror writer and then another program to walk the 
downloaded files. 

Has anybody looked into this or have any suggestions on an alternative 
approach?  The optimal answer would be You dummy, just use XXX to crawl your 
web sites - there's no 'integration' required at all.   Can you believe the 
temerity?   What a poltroon. 

Yours in Revolution, 
George 













   
-
Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try it now.

Re: Performance problems for OR-queries

2007-11-22 Thread Mike Klaas


On 22-Nov-07, at 6:02 AM, Jörg Kiegeland wrote:



1. Does Solr support this kind of index access with better  
performance ?

Is there anything special to define in schema.xml?



No... Solr uses Lucene at it's core, and all matching documents for a
query are scored.

So it is not possible to have a google like performance with  
Solr, i.e. to search for a set of keywords and only the 10 best  
documents are listed, without touching the other millions of (web)  
documents matching less keywords.
I infact would not know how to program such an index, however  
google has done it somehow..


I can be fairly certain that google does not execute queries that  
match millions of documents on a single machine.  The default query  
operator is (mostly) AND, so the possible match sets is much  
smaller.  Also, I imagine they have relatively few documents per  
machine.


2. Can one switch off this ordering and just return any 100  
documents
fullfilling the query (though  getting best-matching documents  
would be

a nice feature if it would be fast)?



a feature like this could be developed... but what is the usecase for
this?  What are you tring to accomplish where either relevancy or
complete matching doesn't matter?  There may be an easier workaround
for your specific case.

This is not an actual Use-Case for my project, however I just  
wanted to know if it would be possible.


Because of the performance results, we designed a new type of  
query. I would like to know how fast it would be before I implement  
the following query:


I have N keywords and execute a query of the form

keyword1 AND keyword2 AND .. AND keywordN

there may be again some millions of matching documents and I want  
to get the first 100 documents.
To have a ordering criteria, each Solr document has a field named  
REV which has a natural number. The returned 100 documents shall  
be those with

the lowest numbers in the REV field.

My questions now are:

(1) Will the query perform in O(100) or in O(all possible matches)?


O(all possible matches)

(2) If the answer to (1) is O(all possible matches), what will be  
the performance if I dont order for the REV field? Will Solr  
order it after the point of time where a document was created/ 
modified? What I have to do to get O(100) complexity finally?


Ordering by natural document order in the index is sufficient to  
achieve O(100), but you'll have to insert code in Solr to stop after  
100 docs (another alternative is to stop processing after a given  
amount of time).  Also, using O() in the case isn't quite accurate:  
there are costs that vary based on the number of docs in the index too.


-Mike

Re: Heritrix and Solr

2007-11-22 Thread Norberto Meijome

On Thu, 22 Nov 2007 10:41:41 -0500
George Everitt [EMAIL PROTECTED] wrote:

 After a lot of googling, I came across Heritrix, which seems to be the  
 most robust well supported open source crawler out there.   Heritrix  
 has an integration with Nutch (NutchWax), but not with Solr.   I'm  
 wondering if anybody can share any experience using Heritrix with Solr.

out on a limb here... both Nutch and SOLR use Lucene for the actual indexing / 
searching. Would the indexes generated with Nutch be compatible / readable with 
SOLR? 

_
{Beto|Norberto|Numard} Meijome

Why do you sit there looking like an envelope without any address on it?
  Mark Twain

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.

Re: Any tips for indexing large amounts of data?

2007-11-22 Thread Otis Gospodnetic

Brendan - yes, 64-bit Linux this is, and the JVM got 5.5 GB heap, though it 
could have worked with less.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Brendan Grainger [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Wednesday, November 21, 2007 1:24:05 PM
Subject: Re: Any tips for indexing large amounts of data?

Hi Otis,

Thanks for this. Are you using a flavor of linux and is it 64bit? How  
much heap are you giving your jvm?

Thanks again
Brendan

On Nov 21, 2007, at 2:03 AM, Otis Gospodnetic wrote:

 Mike is right about the occasional slow-down, which appears as a  
 pause and is due to large Lucene index segment merging.  This  
 should go away with newer versions of Lucene where this is  
 happening in the background.

 That said, we just indexed about 20MM documents on a single 8-core  
 machine with 8 GB of RAM, resulting in nearly 20 GB index.  The  
 whole process took a little less than 10 hours - that's over 550  
 docs/second.  The vanilla approach before some of our changes  
 apparently required several days to index the same amount of data.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: Mike Klaas [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Monday, November 19, 2007 5:50:19 PM
 Subject: Re: Any tips for indexing large amounts of data?

 There should be some slowdown in larger indices as occasionally large
 segment merge operations must occur.  However, this shouldn't really
 affect overall speed too much.

 You haven't really given us enough data to tell you anything useful.
 I would recommend trying to do the indexing via a webapp to eliminate
 all your code as a possible factor.  Then, look for signs to what is
 happening when indexing slows.  For instance, is Solr high in cpu, is
 the computer thrashing, etc?

 -Mike

 On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:

 Hi,

 Thanks for answering this question a while back. I have made some
 of the suggestions you mentioned. ie not committing until I've
 finished indexing. What I am seeing though, is as the index get
 larger (around 1Gb), indexing is taking a lot longer. In fact it
 slows down to a crawl. Have you got any pointers as to what I might
 be doing wrong?

 Also, I was looking at using MultiCore solr. Could this help in
 some way?

 Thank you
 Brendan

 On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:

 : I would think you would see better performance by allowing auto
 commit
 : to handle the commit size instead of reopening the connection
 all the
 : time.

 if your goal is fast indexing, don't use autoCommit at all ...
  just
 index everything, and don't commit until you are completely done.

 autoCommitting will slow your indexing down (the benefit being
 that more
 results will be visible to searchers as you proceed)

 -Hoss

Re: Heritrix and Solr

2007-11-22 Thread Norberto Meijome

On Thu, 22 Nov 2007 19:10:46 -0800 (PST)
Otis Gospodnetic [EMAIL PROTECTED] wrote:

 The answer to that question, Norberto, would depend on versions.

Otis, would that relate to what underlying version of Lucene is being used in 
either Solr  Nutch? 

_
{Beto|Norberto|Numard} Meijome

Web2.0 is outsourced RD from Web1.0 companies.
  The Reverend

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.

C++ type of analysis issues

2007-11-22 Thread Yu-Hui Jin

Hi, there,

I haven't found any existing filter/tokenizer that can deal with C++
type of search keywords.  I'm using WordDelimiterFilter which removes
the ++.

One way I am thinking of right now is to use synonym filter before the
WordDelimiterFilter to replace c++ (after low-cased it) with say
cpp.  And use the synonym filter for both indexing and querying.
That would cause a cpp string to be found as a result of search
c++ (or C++). But I guess this is not a big problem.

Anyway, I feel this is a common issue and must be solved by someone
already, so anyone has a better solution?


Thanks,

-Hui

Re: Document update based on ID

2007-11-22 Thread Walter Underwood

This can be useful, but it is limited. At Infoseek, we used this
for demoting porn and spam in the index in 1996, but replaced it
with more precise approaches.

wunder

On 11/22/07 6:49 AM, Ryan McKinley [EMAIL PROTECTED] wrote:

 Jörg Kiegeland wrote:
 
 Yes, SOLR-139 will eventually do what you need.
 
 The most recent patch should not be *too* hard to get running (it may
 not apply cleanly though)  The patch as is needs to be reworked before
 it will go into trunk.  I hope this will happen in the next month or so.
 
 As for production?  It depends ;)  The API will most likely change so
 if you base your code on the current patch, it will need to change
 when things finalize.  As for stability, it has worked well for me
 (and I think for Erik)
 
 A useful feature would be update based on query, so that documents
 matching the query condition will all be modified in the same way on the
 given update fields.
 If this feature also available in future?
 
 
 interesting, I had not thought of that - but it could be useful.
 (Potentially dangerous and resource intensive, but so is rm)
 
 Can you add a comment to SOLR-139 with this idea?  Once SOLR-139 is more
 stable, it would make sense to do this as a new issue.
 
 ryan

Re: Heritrix and Solr

2007-11-22 Thread Otis Gospodnetic

The answer to that question, Norberto, would depend on versions.

George: why not just use straight Nutch and forget about Heritrix?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Norberto Meijome [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Cc: [EMAIL PROTECTED]
Sent: Thursday, November 22, 2007 5:54:32 PM
Subject: Re: Heritrix and Solr

On Thu, 22 Nov 2007 10:41:41 -0500
George Everitt [EMAIL PROTECTED] wrote:

 After a lot of googling, I came across Heritrix, which seems to be
 the  
 most robust well supported open source crawler out there.   Heritrix
  
 has an integration with Nutch (NutchWax), but not with Solr.   I'm  
 wondering if anybody can share any experience using Heritrix with
 Solr.

out on a limb here... both Nutch and SOLR use Lucene for the actual
 indexing / searching. Would the indexes generated with Nutch be compatible
 / readable with SOLR? 

_
{Beto|Norberto|Numard} Meijome

Why do you sit there looking like an envelope without any address on
 it?
  Mark Twain

I speak for myself, not my employer. Contents may be hot. Slippery when
 wet. Reading disclaimers makes you go blind. Writing them is worse.
 You have been Warned.

Re: Heritrix and Solr

2007-11-22 Thread Otis Gospodnetic

Hi George,
Thank you for your kind words about Lucene in Action. :)

I wouldn't compare Solr and Nutch, they are really made for different things.  
I was suggesting Nutch instead of Heritrix, not instead of Solr.  The 
Solr+Nutch patch is in JIRA and there is a fresh patch in therestill warm, 
try it out.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: George Everitt [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Thursday, November 22, 2007 10:58:08 PM
Subject: Re: Heritrix and Solr

Otis:

There are many reasons I prefer Solr to Nutch:

1. I actually tried to do some of the crawling with Nutch, but found  
the crawling options less flexible than I would have liked.
2. I prefer the Solr approach in general.  I have a long background in
  
Verity and Autonomy search, and Solr is a bit closer to them than
 Nutch.
3. I really like the schema support in Solr.
4. I really really like the facets/parametric search in Solr.
5. I really really really like the REST interface in Solr.
6. Finally, and not to put too fine a point on it, hadoop frightens  
the bejeebers out of me.  I've skimmed some of the papers and it looks
  
like a lot of study before I will fully understand it.  I'm not saying
  
I'm stupid and lazy, but if the map-reduce algorithm fits, I'll wear  
it.  Plus, I'm trying to get a mental handle on Jeff Hawkins' HTM and  
it's application to the real world.   It all makes my cerebral cortex  
itchy.

Thanks for the suggestion, though.   I'll probably revisit Nutch again
  
if Heritrix lets me down.  I had no luck getting the Nutch crawler  
Solr patch to work, either.   Sadly, I'm the David Lee Roth of Java  
programmers - I may think that Im hard-core, but I'm not, really. And
  
my groupies are getting a bit saggy.

BTW - add my voice to the paeans of praise for Lucene in Action.   You
  
and Erik did a bang up job, and I surely appreciate all the feedback  
you give on this forum, Especially over the past few months as I feel  
my way through Solr and Lucene.



On Nov 22, 2007, at 10:10 PM, Otis Gospodnetic wrote:

 The answer to that question, Norberto, would depend on versions.

 George: why not just use straight Nutch and forget about Heritrix?

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: Norberto Meijome [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Cc: [EMAIL PROTECTED]
 Sent: Thursday, November 22, 2007 5:54:32 PM
 Subject: Re: Heritrix and Solr

 On Thu, 22 Nov 2007 10:41:41 -0500
 George Everitt [EMAIL PROTECTED] wrote:

 After a lot of googling, I came across Heritrix, which seems to be
 the
 most robust well supported open source crawler out there.   Heritrix

 has an integration with Nutch (NutchWax), but not with Solr.   I'm
 wondering if anybody can share any experience using Heritrix with
 Solr.

 out on a limb here... both Nutch and SOLR use Lucene for the actual
 indexing / searching. Would the indexes generated with Nutch be  
 compatible
 / readable with SOLR?

 _
 {Beto|Norberto|Numard} Meijome

 Why do you sit there looking like an envelope without any address on
 it?
  Mark Twain

 I speak for myself, not my employer. Contents may be hot. Slippery  
 when
 wet. Reading disclaimers makes you go blind. Writing them is worse.
 You have been Warned.

Re: Heritrix and Solr

2007-11-22 Thread George Everitt


Otis:

There are many reasons I prefer Solr to Nutch:

1. I actually tried to do some of the crawling with Nutch, but found  
the crawling options less flexible than I would have liked.
2. I prefer the Solr approach in general.  I have a long background in  
Verity and Autonomy search, and Solr is a bit closer to them than Nutch.

3. I really like the schema support in Solr.
4. I really really like the facets/parametric search in Solr.
5. I really really really like the REST interface in Solr.
6. Finally, and not to put too fine a point on it, hadoop frightens  
the bejeebers out of me.  I've skimmed some of the papers and it looks  
like a lot of study before I will fully understand it.  I'm not saying  
I'm stupid and lazy, but if the map-reduce algorithm fits, I'll wear  
it.  Plus, I'm trying to get a mental handle on Jeff Hawkins' HTM and  
it's application to the real world.   It all makes my cerebral cortex  
itchy.


Thanks for the suggestion, though.   I'll probably revisit Nutch again  
if Heritrix lets me down.  I had no luck getting the Nutch crawler  
Solr patch to work, either.   Sadly, I'm the David Lee Roth of Java  
programmers - I may think that Im hard-core, but I'm not, really. And  
my groupies are getting a bit saggy.


BTW - add my voice to the paeans of praise for Lucene in Action.   You  
and Erik did a bang up job, and I surely appreciate all the feedback  
you give on this forum, Especially over the past few months as I feel  
my way through Solr and Lucene.




On Nov 22, 2007, at 10:10 PM, Otis Gospodnetic wrote:


The answer to that question, Norberto, would depend on versions.

George: why not just use straight Nutch and forget about Heritrix?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Norberto Meijome [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Cc: [EMAIL PROTECTED]
Sent: Thursday, November 22, 2007 5:54:32 PM
Subject: Re: Heritrix and Solr

On Thu, 22 Nov 2007 10:41:41 -0500
George Everitt [EMAIL PROTECTED] wrote:


After a lot of googling, I came across Heritrix, which seems to be

the

most robust well supported open source crawler out there.   Heritrix



has an integration with Nutch (NutchWax), but not with Solr.   I'm
wondering if anybody can share any experience using Heritrix with

Solr.

out on a limb here... both Nutch and SOLR use Lucene for the actual
indexing / searching. Would the indexes generated with Nutch be  
compatible

/ readable with SOLR?

_
{Beto|Norberto|Numard} Meijome

Why do you sit there looking like an envelope without any address on
it?
 Mark Twain

I speak for myself, not my employer. Contents may be hot. Slippery  
when

wet. Reading disclaimers makes you go blind. Writing them is worse.
You have been Warned.

Re: Memory use with sorting problem

2007-11-22 Thread Otis Gospodnetic

I'd have to check, but Luke handler might spit that out.  If not, Lucene's 
TermEnum  co. are your friends. :)

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Chris Laux [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Thursday, November 22, 2007 7:22:56 AM
Subject: Re: Memory use with sorting problem

Thanks for your reply. I made some memory saving changes, as per your
advice, but the problem remains.

 Set the max warming searchers to 1 to ensure that you never have more
 than one warming at the same time.

Done.

 How many documents are in your index?

Currently about 8 million.

 If you don't need range queries on these numeric fields, you might
 try
 switching from sfloat to float and from sint to int.  The
 fieldCache representation will be smaller.

As far as I can see slong etc. is also needed for sorting queries
(which I do, as mentioned). Anyway, I got an error message when I tried
sorting on a long field.

 Is it normal to need that much Memory for such a small index?

 Some things are more related to the number of unique terms or the
 numer of documents more than the size of the index.

Is there a manageable way to find out / limit the number of unique
 terms
in Solr?

Cheers,

Chris

Re: Strange behavior MoreLikeThis Feature

2007-11-22 Thread Rishabh Joshi

Thanks Ryan. I now know the reason why.
Before I explain the reason, let me correct the mistake I made in my earlier
mail. I was not using the first document mentioned in the xml . Instead it
was this one:
doc
  field name=idIW-02/field
  field name=nameiPod amp; iPod Mini USB 2.0 Cable/field
  field name=manuBelkin/field
  field name=catelectronics/field
  field name=catconnector/field
  field name=featurescar power adapter for iPod, white/field
  field name=weight2/field
  field name=price11.50/field
  field name=popularity1/field
  field name=inStockfalse/field
/doc

The reason I was getting strange result was because of the character i.
Here is what I learnt from debug info:

debug:{
  rawquerystring:id:neardup06,
  querystring:id:neardup06,
  parsedquery:features:og features:en features:til features:er
features:af features:der features:ts features:se features:i features:p
features:pet features:brag features:efter features:zombier features:k
features:tilbag features:ala features:sviner features:folk
features:klassisk features:resid features:horder features:lidt
features:man features:denn,
  parsedquery_toString:features:og features:en features:til
features:er features:af features:der features:ts features:se
features:i features:p features:pet features:brag features:efter
features:zombier features:k features:tilbag features:ala
features:sviner features:folk features:klassisk features:resid
features:horder features:lidt features:man features:denn,
  explain:{
id=IW-02,internal_docid=8:\n0.0050230525 = (MATCH) product of:\n
0.12557632 = (MATCH) sum of:\n0.12557632 = (MATCH)
weight(features:i in 8), product of:\n  0.17474915 =
queryWeight(features:i), product of:\n1.9162908 =
idf(docFreq=3)\n0.09119135 = queryNorm\n  0.71860904 =
(MATCH) fieldWeight(features:i in 8), product of:\n1.0 =
tf(termFreq(features:i)=1)\n1.9162908 = idf(docFreq=3)\n
 0.375 = fieldNorm(field=features, doc=8)\n  0.04 = coord(1/25)\n}}}

The field features uses the default fieldtype - text in the schema.xml.
The problem was solved by adding the character i to the
stopwords.txtfile. the is in document 2 were matched with the i in
iPod of document
1.

I still have to figure out why a single character - i - matched the i in
a word - iPod.

Regards,
Rishabh

On 22/11/2007, Ryan McKinley [EMAIL PROTECTED] wrote:

 
  Now when I run the following query:
 
 http://localhost:8080/solr/mlt?q=id:neardup06mlt.fl=featuresmlt.mindf=1mlt.mintf=1mlt.displayTerms=detailswt=jsonindent=on
 

 try adding:
   debugQuery=on

 to your query string and you can see why each document matches...

 My guess is that features uses a text field with stemming and a
 stemmed word matches

 ryan

Re: Memory use with sorting problem

Re: Performance problems for OR-queries

Re: Document update based on ID

Re: Strange behavior MoreLikeThis Feature

Re: Document update based on ID

Grouping multiValued fields

Heritrix and Solr

Re: Heritrix and Solr

Re: Heritrix and Solr

Re: Performance problems for OR-queries

Re: Heritrix and Solr

Re: Any tips for indexing large amounts of data?

Re: Heritrix and Solr

C++ type of analysis issues

Re: Document update based on ID

Re: Heritrix and Solr

Re: Heritrix and Solr

Re: Heritrix and Solr

Re: Memory use with sorting problem

Re: Strange behavior MoreLikeThis Feature

20 matches

Site Navigation

Mail list logo

Footer information