ideas for indexing large amount of pdf docs

2011-08-13 Thread Rode Gonzalez (libnova)
Hi all, I want to ask about the best way to implement a solution for indexing a large amount of pdf documents between 10-60 MB each one. 100 to 1000 users connected simultaneously. I actually have 1 core of solr 3.3.0 and it works fine for a few number of pdf docs but I'm afraid about the

Re: xpath expression not working

2011-08-13 Thread abhayd
thanks Karsten i was able to use ur suggestion -- View this message in context: http://lucene.472066.n3.nabble.com/xpath-expression-not-working-tp3218133p3251481.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: sorting issue with solr 3.3

2011-08-13 Thread Bernd Fehling
The issue was located in a 31 million docs index and i have already reduced it to a reproducable 4 documents index. It is stock solr 3.3.0. Yes, the documents are also in the wrong order as the field sort values. Just added only the field sort values to the email to keep it short. I will produce a

Re: paging size in SOLR

2011-08-13 Thread Erick Erickson
Jame: You control the number via settings in solrconfig.xml, so it's up to you. Jonathan: Hmmm, that's seems right, after all the deep paging penalty is really about keeping a large sorted array in memory but at least you only pay it once per 10,000, rather than 100 times (assuming page size

Re: Example Solr Config on EC2

2011-08-13 Thread Erick Erickson
Do keep in mind though that your index will have to have any documents that have not yet been replicated added to the promoted slave. The easy to do this is just re-index documents from a safe point. If you're using time-based deltas, this is just some time interval far enough in the past to

Re: Indexing tweet and searching @keyword OR #keyword

2011-08-13 Thread Erick Erickson
I don't see an easy way to do that with the standard set of filters. You'll probably need to write something custom (note, this is actually pretty easy). I suspect you'll need to do something like Synonyms, where when you get a token like #ipod, you essentially make it a synonym for ipod and

Re: Sorting suggest results on specific field

2011-08-13 Thread Erick Erickson
You can't, sorting only works with indexed data and only really makes sense for fields that have a single value. Sometimes using KeywordTokenizer helps if your fields has more than one word, perhaps with copyField. Best Erick On Thu, Aug 11, 2011 at 3:34 AM, Anshum ansh...@gmail.com wrote:

Re: NRT in Master- Slave setup, crazy?

2011-08-13 Thread Erick Erickson
Hmmm, it almost seems like you're better off turning off replication entirely. Your master becomes a machine used as a source for rapidly spinning up a new slave or resetting a slave. I have no hard data to back up my misgivings about committing to the slaves then having replication overwrite

Re: Some questions about SolrJ

2011-08-13 Thread Michael Sokolov
On 8/12/2011 4:18 PM, Shawn Heisey wrote: On 8/12/2011 1:49 PM, Shawn Heisey wrote: I am sure that I have more questions, but I may be able to answer a lot of them myself if I can see better examples. Thought of another question. My Perl build system uses DIH for all indexing, but with the

Re: strip html from data

2011-08-13 Thread Erick Erickson
Right, this is expected behavior, it trips a lot of people up. When you specify ' indexed=true ' in your field definitions, the contents of the input stream are put into the inverted index etc, *after* all the transformations you specify via tokenizers, filters, charFilters, etc are applied. In

Re: unique terms and multi-valued fields

2011-08-13 Thread Erick Erickson
Here's a very useful page for looking at what index size means. http://lucene.apache.org/java/3_0_2/fileformats.html#file-names Note that the files having to do with stored data (e.g. *.fdt) have very little impact on searching, they don't consume very many valuable resources. The

Re: Fuzzy search with sort combination - drawback

2011-08-13 Thread Erick Erickson
I'm puzzled by what this means: Is there a way to achieve the customized sort as well as the relevant content on top in this scenario. You say you remove the sorting part, which means your results are returned by relevance calculations. So I'm guessing that a debugQuery=on would show you that

Re: Not update on duplicate key

2011-08-13 Thread Erick Erickson
If you mean just throw the new document on the floor if the index already contains a document with that key, I don't think you can do that. You could write a custom updateHandler that checks first to see whether the particular uniqueKey is in the index I suppose... Best Erick On Fri, Aug 12,

Re: Post content to be indexed to Solr

2011-08-13 Thread Erick Erickson
I don't think this is really do-able. The only thing that comes to my mind is that you could (and this is assuming you're using Tika to handle the file evenutally) send the document through Tika on the client and construct a SolrJ document on the parts you care about. This would give you

Re: Some questions about SolrJ

2011-08-13 Thread Michael Sokolov
Shawn, my experience with SolrJ in that configuration (no autoCommit) is that you have control over commits: if you don't issue an explicit commit, it won't happen. Re lifecycle: we don't use a static instance; rather our app maintains a small pool of CommonsHttpSolrServer instances that we

Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Erick Erickson
The problem I've always had is that I don't quite know what sorting on multivalued fields means. If your field had tokens a and z, would sorting on that field put the doc at the beginning or end of the list? Sure, you can define rules (first token, last token, average of all tokens

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Erick Erickson
Yeah, parsing PDF files can be pretty resource-intensive, so one solution is to offload it somewhere else. You can use the Tika libraries in SolrJ to parse the PDFs on as many clients as you want, just transmitting the results to Solr for indexing. HOw are all these docs being submitted? Is this

Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Michael Lackhoff
On 13.08.2011 18:03 Erick Erickson wrote: The problem I've always had is that I don't quite know what sorting on multivalued fields means. If your field had tokens a and z, would sorting on that field put the doc at the beginning or end of the list? Sure, you can define rules (first

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Rode Gonzalez (libnova)
Hi Erick, Our app insert the pdf from a backoffice site and the people can search/consult throught a front end site. Both written in php. I've installed a tomcat for solr exclusivelly. the pdf docs are indexed and not stored using the standard solr.extraction.ExtractingRequestHandler

Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Martijn v Groningen
The first solution would make sense to me. Some kind of a strategy mechanism for this would allow anyone to define their own rules. Duplicating results would be confusing to me. On 13 August 2011 18:39, Michael Lackhoff mich...@lackhoff.de wrote: On 13.08.2011 18:03 Erick Erickson wrote: The

Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Bill Bell
I have a different use case. Consider a spatial multivalued field with latlong values for addresses. I would want sort by geodist() to return the closest distance in each group. For example find me the closest restaurant which each doc being a chain name like pizza hut. Or doctors with multiple

Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Michael Lackhoff
On 13.08.2011 20:31 Martijn v Groningen wrote: The first solution would make sense to me. Some kind of a strategy mechanism for this would allow anyone to define their own rules. Duplicating results would be confusing to me. That is why I would only activate it on request (setting a special

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Bill Bell
You could send PDF for processing using a queue solution like Amazon SQS. Kick off Amazon instances to process the queue. Once you process with Tika to text just send the update to Solr. Bill Bell Sent from mobile On Aug 13, 2011, at 10:13 AM, Erick Erickson erickerick...@gmail.com wrote:

Re: Problem with xinclude in solrconfig.xml

2011-08-13 Thread Bill Bell
What was it? Bill Bell Sent from mobile On Aug 10, 2011, at 2:21 PM, Way Cool way1.wayc...@gmail.com wrote: Sorry for the spam. I just figured it out. Thanks. On Wed, Aug 10, 2011 at 2:17 PM, Way Cool way1.wayc...@gmail.com wrote: Hi, Guys, Based on the document below, I should be

Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Erick Erickson
Fair enough, but what's first value in the list? There's nothing special about mutliValued fields, that is where the schema has multiValued=true. under the covers, this is no different than just concatenating all the values together and putting them in at one go, except for some games with the

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Erick Erickson
Ahhh, ok, my reply was irrelevant G... Here's a good write-up on this problem: http://www.lucidimagination.com/content/scaling-lucene-and-solr But Solr handles millions of documents on a single server in many cases, so waiting until the search app falls over is actually feasible. In general, if

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Rode Gonzalez (libnova)
Thanks Erick, Bill. Your answers tell me that we're in the right way ;) I will study the master/slave architecture for many slaves. In the future perhaps we will need it =) Best regards, Rode. -Original Message- From: Erick Erickson erickerick...@gmail.com To:

Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Michael Lackhoff
On 13.08.2011 21:28 Erick Erickson wrote: Fair enough, but what's first value in the list? There's nothing special about mutliValued fields, that is where the schema has multiValued=true. under the covers, this is no different than just concatenating all the values together and putting them

Re: updating existing data in index vs inserting new data in index

2011-08-13 Thread Alexandre Sompheng
Hi Mark, I guess the commit=true when doing a delta-import is the solution for the JIRA I just submit SOLR-2711. Can you explain to me where you configured this info commit=true ? thanks, Alex On Thu, Jul 7, 2011 at 6:44 PM, Mark juszczec mark.juszc...@gmail.comwrote: First thanks for all the

Re: updating existing data in index vs inserting new data in index

2011-08-13 Thread Alexandre Sompheng
Actually I requested .../dataimport?command=delta-importcommit=true And DIH in delta-import mode does not commit. Do you have any guess ??? INFO: Starting Delta Import Aug 14, 2011 1:42:02 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/apache-solr-3.3.0 path=/dataimport

exceeded limit of maxWarmingSearchers ERROR

2011-08-13 Thread Naveen Gupta
Hi, Most of the settings are default. We have single node (Memory 1 GB, Index Size 4GB) We have a requirement where we are doing very fast commit. This is kind of real time requirement where we are polling many threads from third party and indexes into our system. We want these results to be

Date Facet Question

2011-08-13 Thread Jamie Johnson
When doing Date faceting I've noticed that if the query is something like: start: NOW-1YEAR end: NOW GAP: +1MONTH when the response comes back the facet names are 2010-08-14T01:50:58.813Z 2010-09-14T01:50:58.813Z 2010-10-14T01:50:58.813Z 2010-11-14T01:50:58.813Z 2010-12-14T01:50:58.813Z etc

Re: NRT in Master- Slave setup, crazy?

2011-08-13 Thread Mark Miller
On Aug 11, 2011, at 9:53 AM, eks dev wrote: Thinking aloud and grateful for sparing .. I need to support high commit rate (low update latency) in a master slave setup and I have a bad feelings about it, even with disabling warmup and stripping everything down that slows down refresh. I