date:20110915

Re: solr 1.4 highlighting issue

2011-09-15 Thread Dmitry Kan

Koji,

This looks strange to me, because I would assume, that highlighter also
applies boolean logic same way as a query parser. In this way of thinking
drilling should be highlighted if ships occurred together in the same
document. Which wasn't the case in the example.

Dmitry

On Wed, Sep 14, 2011 at 2:20 PM, Koji Sekiguchi k...@r.email.ne.jp wrote:

 (11/09/14 15:54), Dmitry Kan wrote:

 Hello list,

 Not sure how many of you are still using solr 1.4 in production, but here
 is
 an issue with highlighting, that we've noticed:

 The query is:

 (drill AND ships) OR rigs


 Excerpt from the highlighting list:

 arr name=Contents
 str
 Within the fleet of 27 floatinglt;emrigslt;/em  (semisubmersibles and
 drillships) are 21 deepwaterlt;emdrillinglt;/**em
 /str
 /arr
 /lst



 Why did solr highlight drilling even though there is no ships in the
 text?


 Dmitry,

 This is expected, even if you use the latest version of Solr.

 You got the document because rigs was hit in the document, but then
 Highlighter
 tries to search individual terms of the query in the document again.

 koji
 --
 Check out Query Log Visualizer for Apache Solr
 http://www.rondhuit-demo.com/**loganalyzer/loganalyzer.htmlhttp://www.rondhuit-demo.com/loganalyzer/loganalyzer.html
 http://www.rondhuit.com/en/




-- 
Regards,

Dmitry Kan

Re: solr 1.4 highlighting issue

2011-09-15 Thread Dmitry Kan

Hi Mike,

Actually, the example I gave is the document in this case. So there was no
ships, only drilling.

Dmitry

On Wed, Sep 14, 2011 at 1:59 PM, Michael Sokolov soko...@ifactory.comwrote:

 The highlighter gives you snippets of text surrounding words (terms) drawn
 from the query.  The whole document should satisfy the query (ie it probably
 has ships/s somewhere else in it), but each snippet won't generally have all
 the terms.

 -Mike


 On 9/14/2011 2:54 AM, Dmitry Kan wrote:

 Hello list,

 Not sure how many of you are still using solr 1.4 in production, but here
 is
 an issue with highlighting, that we've noticed:

 The query is:

 (drill AND ships) OR rigs


 Excerpt from the highlighting list:

 arr name=Contents
 str
 Within the fleet of 27 floatinglt;emrigslt;/em  (semisubmersibles and
 drillships) are 21 deepwaterlt;emdrillinglt;/**em
 /str
 /arr
 /lst



 Why did solr highlight drilling even though there is no ships in the
 text?

 *
 *--
 Regards,

 Dmitry Kan





-- 
Regards,

Dmitry Kan

Re: math with date and modulo

2011-09-15 Thread stockii

okay, thanks a lot.

I thought, that isnt possible to get the month in my case =(  i will try out
another way.

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 
1 Core with 45 Million Documents other Cores  200.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/math-with-date-and-modulo-tp3335800p3338207.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Norms - scoring issue

2011-09-15 Thread Ahmet Arslan

It seems that fieldNorm difference is coming from the field named 'text'. And 
you didn't include the definition of text field. Did you omit norms for that 
field too?


By the way I see that you have store=true in some places but it should be 
store*d*=true.

--- On Wed, 9/14/11, Adolfo Castro Menna adolfo.castrome...@gmail.com wrote:

 From: Adolfo Castro Menna adolfo.castrome...@gmail.com
 Subject: Norms - scoring issue
 To: solr-user@lucene.apache.org
 Date: Wednesday, September 14, 2011, 11:13 PM
 Hi All,
 
 I hope someone could shed some light on the issue I'm
 facing with solr
 3.1.0. It looks like it's computing diferrent fieldNorm
 values despite my
 configuration that aims to ignore it.
 
    field name=item_name type=textgen
 indexed=true store=true
 omitNorms=true omitTermFrequencyAndPositions=true
 /
    field name=item_description
 type=textTight indexed=true
 store=true omitNorms=true
 omitTermFrequencyAndPositions=true /
    field name=item_tags type=text
 indexed=true stored=true
 multiValued=true omitNorms=true
 omitTermFrequencyAndPositions=true /
 
 I also have a custom class that extends DefaultSimilarity
 to override the
 idf method.
 
 Query:
 
 str name=qitem_name:octopus seafood OR
 item_description:octopus seafood
 OR item_tags:octopus seafood/str
 str name=sortscore desc,item_ranking
 desc/str
 
 The first 2 results are:
 doc
 float name=score0.5217492/float
 str name=item_nameGrilled Octopus/str
 arr name=item_tagsstrSeafood,
 tapas/str/arr
 /doc
 doc
     float
 name=score0.49379835/float
    str name=item_nameoctopus
 marisco/str
    arr
 name=item_tagsstrAppetizer, Mexican, Seafood,
 food/str/arr
 /doc
 
 Does anyone know why they get a different score? I'm
 expecting them to have
 the same scoring because both matched the two search
 terms.
 
 I checked the debug information and it seems that the
 difference involves
 the fieldNorm values.
 
 1) Grilled Octopus
 0.52174926 = (MATCH) product of:
   0.7826238 = (MATCH) sum of:
     0.4472136 = (MATCH) weight(item_name:octopus
 in 69), product of:
       0.4472136 =
 queryWeight(item_name:octopus), product of:
         1.0 = idf(docFreq=2,
 maxDocs=449)
         0.4472136 = queryNorm
       1.0 = (MATCH)
 fieldWeight(item_name:octopus in 69), product of:
         1.0 =
 tf(termFreq(item_name:octopus)=1)
         1.0 = idf(docFreq=2,
 maxDocs=449)
         1.0 =
 fieldNorm(field=item_name, doc=69)
     0.1118034 = (MATCH) weight(text:seafood in
 69), product of:
       0.4472136 = queryWeight(text:seafood),
 product of:
         1.0 = idf(docFreq=8,
 maxDocs=449)
         0.4472136 = queryNorm
       0.25 = (MATCH)
 fieldWeight(text:seafood in 69), product of:
         1.0 =
 tf(termFreq(text:seafood)=1)
         1.0 = idf(docFreq=8,
 maxDocs=449)
         0.25 = fieldNorm(field=text,
 doc=69)
     0.1118034 = (MATCH) weight(text:seafood in
 69), product of:
       0.4472136 = queryWeight(text:seafood),
 product of:
         1.0 = idf(docFreq=8,
 maxDocs=449)
         0.4472136 = queryNorm
       0.25 = (MATCH)
 fieldWeight(text:seafood in 69), product of:
         1.0 =
 tf(termFreq(text:seafood)=1)
         1.0 = idf(docFreq=8,
 maxDocs=449)
         0.25 = fieldNorm(field=text,
 doc=69)
     0.1118034 = (MATCH) weight(text:seafood in
 69), product of:
       0.4472136 = queryWeight(text:seafood),
 product of:
         1.0 = idf(docFreq=8,
 maxDocs=449)
         0.4472136 = queryNorm
       0.25 = (MATCH)
 fieldWeight(text:seafood in 69), product of:
         1.0 =
 tf(termFreq(text:seafood)=1)
         1.0 = idf(docFreq=8,
 maxDocs=449)
         0.25 = fieldNorm(field=text,
 doc=69)
   0.667 = coord(4/6)
 
 2) octopus marisco
 
 0.49379835 = (MATCH) product of:
   0.7406975 = (MATCH) sum of:
     0.4472136 = (MATCH) weight(item_name:octopus
 in 81), product of:
       0.4472136 =
 queryWeight(item_name:octopus), product of:
         1.0 = idf(docFreq=2,
 maxDocs=449)
         0.4472136 = queryNorm
       1.0 = (MATCH)
 fieldWeight(item_name:octopus in 81), product of:
         1.0 =
 tf(termFreq(item_name:octopus)=1)
         1.0 = idf(docFreq=2,
 maxDocs=449)
         1.0 =
 fieldNorm(field=item_name, doc=81)
     0.09782797 = (MATCH) weight(text:seafood in
 81), product of:
       0.4472136 = queryWeight(text:seafood),
 product of:
         1.0 = idf(docFreq=8,
 maxDocs=449)
         0.4472136 = queryNorm
       0.21875 = (MATCH)
 fieldWeight(text:seafood in 81), product of:
         1.0 =
 tf(termFreq(text:seafood)=1)
         1.0 = idf(docFreq=8,
 maxDocs=449)
         0.21875 = fieldNorm(field=text,
 doc=81)
     0.09782797 = (MATCH) weight(text:seafood in
 81), product of:
       0.4472136 = queryWeight(text:seafood),
 product of:
         1.0 = idf(docFreq=8,
 maxDocs=449)
         0.4472136 = queryNorm
       0.21875 = (MATCH)
 fieldWeight(text:seafood in 81), product of:
         1.0 =
 tf(termFreq(text:seafood)=1)
         1.0 = idf(docFreq=8,
 maxDocs=449)
         0.21875 = fieldNorm(field=text,
 doc=81)

Re: Out of memory

2011-09-15 Thread Dmitry Kan

Hello,
Since you use caching, you can monitor the eviction parameter on the solr
admin page (http://localhost:port/solr/admin/stats.jsp#cache). If it is non
zero, the cache can be made e.g. bigger.
queryResultWindowSize=50 in my case.
Not sure, if solr 3.1 supports, but in 1.4 I have:
HashDocSet maxSize=1000 loadFactor=0.75/

Does the OOM happen on update/commit or search?

Dmitry

On Wed, Sep 14, 2011 at 2:47 PM, Rohit ro...@in-rev.com wrote:

 Thanks Dmirty for the offer to help, I am using some caching in one of the
 cores not. Earlier I was using on other cores too, but now I have commented
 them out because of frequent OOM, also some warming up in one of the core. I
 have share the links for my config files for all the 4 cores,

 http://haklus.com/crssConfig.xml
 http://haklus.com/rssConfig.xml
 http://haklus.com/twitterConfig.xml
 http://haklus.com/facebookConfig.xml


 Thanks again
 Rohit


 -Original Message-
 From: Dmitry Kan [mailto:dmitry@gmail.com]
 Sent: 14 September 2011 10:23
 To: solr-user@lucene.apache.org
 Subject: Re: Out of memory

 Hi,

 OK 64GB fits into one shard quite nicely in our setup. But I have never
 used
 multicore setup. In total you have 79,9 GB. We try to have 70-100GB per
 shard with caching on. Do you do warming up of your index on starting?
 Also,
 there was a setting of pre-populating the cache.

 It could also help, if you can show some parts of your solrconfig file.
 What
 is the solr version you use?

 Regards,
 Dmitry

 On Wed, Sep 14, 2011 at 11:38 AM, Rohit ro...@in-rev.com wrote:

  Hi Dimtry,
 
  To answer your questions,
 
  -Do you use caching?
  I do user caching, but will disable it and give it a go.
 
  -How big is your index in size on the disk?
  These are the size of the data folder for each of the cores.
  Core1 : 64GB
  Core2 : 6.1GB
  Core3 : 7.9GB
  Core4 : 1.9GB
 
  Will try attaching a jconsole to my solr as suggested to get a better
  picture.
 
  Regards,
  Rohit
 
 
  -Original Message-
  From: Dmitry Kan [mailto:dmitry@gmail.com]
  Sent: 14 September 2011 08:15
  To: solr-user@lucene.apache.org
  Subject: Re: Out of memory
 
  Hi Rohit,
 
  Do you use caching?
  How big is your index in size on the disk?
  What is the stack trace contents?
 
  The OOM problems that we have seen so far were related to the
  index physical size and usage of caching. I don't think we have ever
 found
  the exact cause of these problems, but sharding has helped to keep each
  index relatively small and OOM have gone away.
 
  You can also attach jconsole onto your SOLR via the jmx and monitor the
  memory / cpu usage in a graphical interface. I have also run garbage
  collector manually through jconsole sometimes and it was of a help.
 
  Regards,
  Dmitry
 
  On Wed, Sep 14, 2011 at 9:10 AM, Rohit ro...@in-rev.com wrote:
 
   Thanks Jaeger.
  
   Actually I am storing twitter streaming data into the core, so the rate
  of
   index is about 12tweets(docs)/second. The same solr contains 3 other
  cores
   but these cores are not very heavy. Now the twitter core has become
 very
   large (77516851) and its taking a long time to query (Mostly facet
  queries
   based on date, string fields).
  
   After sometime about 18-20hr solr goes out of memory, the thread dump
   doesn't show anything. How can I improve this besides adding more ram
  into
   the system.
  
  
  
   Regards,
   Rohit
   Mobile: +91-9901768202
   About Me: http://about.me/rohitg
  
   -Original Message-
   From: Jaeger, Jay - DOT [mailto:jay.jae...@dot.wi.gov]
   Sent: 13 September 2011 21:06
   To: solr-user@lucene.apache.org
   Subject: RE: Out of memory
  
   numDocs is not the number of documents in memory.  It is the number of
   documents currently in the index (which is kept on disk).  Same goes
 for
   maxDocs, except that it is a count of all of the documents that have
 ever
   been in the index since it was created or optimized (including deleted
   documents).
  
   Your subject indicates that something is giving you some kind of Out of
   memory error.  We might better be able to help you if you provide more
   information about your exact problem.
  
   JRJ
  
  
   -Original Message-
   From: Rohit [mailto:ro...@in-rev.com]
   Sent: Tuesday, September 13, 2011 2:29 PM
   To: solr-user@lucene.apache.org
   Subject: Out of memory
  
   I have solr running on a machine with 18Gb Ram , with 4 cores. One of
 the
   core is very big containing 77516851 docs, the stats for searcher given
   below
  
  
  
   searcherName : Searcher@5a578998 main
   caching : true
   numDocs : 77516851
   maxDoc : 77518729
   lockFactory=org.apache.lucene.store.NativeFSLockFactory@5a9c5842
   indexVersion : 1308817281798
   openedAt : Tue Sep 13 18:59:52 GMT 2011
   registeredAt : Tue Sep 13 19:00:55 GMT 2011
   warmupTime : 63139
  
  
  
   . Is there a way to reduce the number of docs loaded into
 memory
   for
   this core?
  
   . At any given

Re: Terms.regex performance issue

2011-09-15 Thread tbarbugli

Hi,
I do have the same problem, i am looking for infix autocomplete, could you
elaborate a bit on your QueryConverter - Suggester solution ?
Thank You!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Terms-regex-performance-issue-tp3268994p3338273.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Out of memory

2011-09-15 Thread Rohit

It's happening more in search and search has become very slow particularly on 
the core with 69GB index data.

Regards,
Rohit

-Original Message-
From: Dmitry Kan [mailto:dmitry@gmail.com] 
Sent: 15 September 2011 07:51
To: solr-user@lucene.apache.org
Subject: Re: Out of memory

Hello,
Since you use caching, you can monitor the eviction parameter on the solr
admin page (http://localhost:port/solr/admin/stats.jsp#cache). If it is non
zero, the cache can be made e.g. bigger.
queryResultWindowSize=50 in my case.
Not sure, if solr 3.1 supports, but in 1.4 I have:
HashDocSet maxSize=1000 loadFactor=0.75/

Does the OOM happen on update/commit or search?

Dmitry

On Wed, Sep 14, 2011 at 2:47 PM, Rohit ro...@in-rev.com wrote:

 Thanks Dmirty for the offer to help, I am using some caching in one of the
 cores not. Earlier I was using on other cores too, but now I have commented
 them out because of frequent OOM, also some warming up in one of the core. I
 have share the links for my config files for all the 4 cores,

 http://haklus.com/crssConfig.xml
 http://haklus.com/rssConfig.xml
 http://haklus.com/twitterConfig.xml
 http://haklus.com/facebookConfig.xml

 Thanks again
 Rohit

 -Original Message-
 From: Dmitry Kan [mailto:dmitry@gmail.com]
 Sent: 14 September 2011 10:23
 To: solr-user@lucene.apache.org
 Subject: Re: Out of memory

 Hi,

 OK 64GB fits into one shard quite nicely in our setup. But I have never
 used
 multicore setup. In total you have 79,9 GB. We try to have 70-100GB per
 shard with caching on. Do you do warming up of your index on starting?
 Also,
 there was a setting of pre-populating the cache.

 It could also help, if you can show some parts of your solrconfig file.
 What
 is the solr version you use?

 Regards,
 Dmitry

 On Wed, Sep 14, 2011 at 11:38 AM, Rohit ro...@in-rev.com wrote:

  Hi Dimtry,

  To answer your questions,

  -Do you use caching?
  I do user caching, but will disable it and give it a go.

  -How big is your index in size on the disk?
  These are the size of the data folder for each of the cores.
  Core1 : 64GB
  Core2 : 6.1GB
  Core3 : 7.9GB
  Core4 : 1.9GB

  Will try attaching a jconsole to my solr as suggested to get a better
  picture.

  Regards,
  Rohit

  -Original Message-
  From: Dmitry Kan [mailto:dmitry@gmail.com]
  Sent: 14 September 2011 08:15
  To: solr-user@lucene.apache.org
  Subject: Re: Out of memory

  Hi Rohit,

  Do you use caching?
  How big is your index in size on the disk?
  What is the stack trace contents?

  The OOM problems that we have seen so far were related to the
  index physical size and usage of caching. I don't think we have ever
 found
  the exact cause of these problems, but sharding has helped to keep each
  index relatively small and OOM have gone away.

  You can also attach jconsole onto your SOLR via the jmx and monitor the
  memory / cpu usage in a graphical interface. I have also run garbage
  collector manually through jconsole sometimes and it was of a help.

  Regards,
  Dmitry

  On Wed, Sep 14, 2011 at 9:10 AM, Rohit ro...@in-rev.com wrote:

   Thanks Jaeger.

   Actually I am storing twitter streaming data into the core, so the rate
  of
   index is about 12tweets(docs)/second. The same solr contains 3 other
  cores
   but these cores are not very heavy. Now the twitter core has become
 very
   large (77516851) and its taking a long time to query (Mostly facet
  queries
   based on date, string fields).

   After sometime about 18-20hr solr goes out of memory, the thread dump
   doesn't show anything. How can I improve this besides adding more ram
  into
   the system.

   Regards,
   Rohit
   Mobile: +91-9901768202
   About Me: http://about.me/rohitg

   -Original Message-
   From: Jaeger, Jay - DOT [mailto:jay.jae...@dot.wi.gov]
   Sent: 13 September 2011 21:06
   To: solr-user@lucene.apache.org
   Subject: RE: Out of memory

   numDocs is not the number of documents in memory.  It is the number of
   documents currently in the index (which is kept on disk).  Same goes
 for
   maxDocs, except that it is a count of all of the documents that have
 ever
   been in the index since it was created or optimized (including deleted
   documents).

   Your subject indicates that something is giving you some kind of Out of
   memory error.  We might better be able to help you if you provide more
   information about your exact problem.

   JRJ

   -Original Message-
   From: Rohit [mailto:ro...@in-rev.com]
   Sent: Tuesday, September 13, 2011 2:29 PM
   To: solr-user@lucene.apache.org
   Subject: Out of memory

   I have solr running on a machine with 18Gb Ram , with 4 cores. One of
 the
   core is very big containing 77516851 docs, the stats for searcher given
   below

   searcherName : Searcher@5a578998 main
   caching : true
   numDocs : 77516851
   maxDoc : 77518729

Re: why we need the index information in a database ?

2011-09-15 Thread Gora Mohanty

On Thu, Sep 15, 2011 at 2:53 PM, kiran.bodigam kiran.bodi...@gmail.com wrote:
 why we need the index information in a database is because it is clusterable.
 In other words, we may have/need more than one instance of the SOLR engine
 running.
[...]

Not sure if you are after multiple instances that replicate between
each other, or a solution that scales on demand. Both are possible:
Please see, e.g.,
  http://wiki.apache.org/solr/SolrReplication
  http://wiki.apache.org/solr/SolrCloud
If you could explain details of what you want, people might be
better able to advise you.

As people have pointed out, putting Solr's index into a database
makes no sense, and will almost certainly never be officially
supported.

Regards,
Gora

Re: Out of memory

2011-09-15 Thread Dmitry Kan

If you have many users you could scale vertically, i.e. do replication. Buf
before that you could do sharding, for example by indexing entries based on
a hash function. Let's say split 69GB to two shards first and experiment
with it.

Regards,
Dmitry

On Thu, Sep 15, 2011 at 12:22 PM, Rohit ro...@in-rev.com wrote:

 It's happening more in search and search has become very slow particularly
 on the core with 69GB index data.

 Regards,
 Rohit

 -Original Message-
 From: Dmitry Kan [mailto:dmitry@gmail.com]
 Sent: 15 September 2011 07:51
 To: solr-user@lucene.apache.org
 Subject: Re: Out of memory

 Hello,
 Since you use caching, you can monitor the eviction parameter on the solr
 admin page (http://localhost:port/solr/admin/stats.jsp#cache). If it is
 non
 zero, the cache can be made e.g. bigger.
 queryResultWindowSize=50 in my case.
 Not sure, if solr 3.1 supports, but in 1.4 I have:
 HashDocSet maxSize=1000 loadFactor=0.75/

 Does the OOM happen on update/commit or search?

 Dmitry

 On Wed, Sep 14, 2011 at 2:47 PM, Rohit ro...@in-rev.com wrote:

  Thanks Dmirty for the offer to help, I am using some caching in one of
 the
  cores not. Earlier I was using on other cores too, but now I have
 commented
  them out because of frequent OOM, also some warming up in one of the
 core. I
  have share the links for my config files for all the 4 cores,
 
  http://haklus.com/crssConfig.xml
  http://haklus.com/rssConfig.xml
  http://haklus.com/twitterConfig.xml
  http://haklus.com/facebookConfig.xml
 
 
  Thanks again
  Rohit
 
 
  -Original Message-
  From: Dmitry Kan [mailto:dmitry@gmail.com]
  Sent: 14 September 2011 10:23
  To: solr-user@lucene.apache.org
  Subject: Re: Out of memory
 
  Hi,
 
  OK 64GB fits into one shard quite nicely in our setup. But I have never
  used
  multicore setup. In total you have 79,9 GB. We try to have 70-100GB per
  shard with caching on. Do you do warming up of your index on starting?
  Also,
  there was a setting of pre-populating the cache.
 
  It could also help, if you can show some parts of your solrconfig file.
  What
  is the solr version you use?
 
  Regards,
  Dmitry
 
  On Wed, Sep 14, 2011 at 11:38 AM, Rohit ro...@in-rev.com wrote:
 
   Hi Dimtry,
  
   To answer your questions,
  
   -Do you use caching?
   I do user caching, but will disable it and give it a go.
  
   -How big is your index in size on the disk?
   These are the size of the data folder for each of the cores.
   Core1 : 64GB
   Core2 : 6.1GB
   Core3 : 7.9GB
   Core4 : 1.9GB
  
   Will try attaching a jconsole to my solr as suggested to get a better
   picture.
  
   Regards,
   Rohit
  
  
   -Original Message-
   From: Dmitry Kan [mailto:dmitry@gmail.com]
   Sent: 14 September 2011 08:15
   To: solr-user@lucene.apache.org
   Subject: Re: Out of memory
  
   Hi Rohit,
  
   Do you use caching?
   How big is your index in size on the disk?
   What is the stack trace contents?
  
   The OOM problems that we have seen so far were related to the
   index physical size and usage of caching. I don't think we have ever
  found
   the exact cause of these problems, but sharding has helped to keep each
   index relatively small and OOM have gone away.
  
   You can also attach jconsole onto your SOLR via the jmx and monitor the
   memory / cpu usage in a graphical interface. I have also run garbage
   collector manually through jconsole sometimes and it was of a help.
  
   Regards,
   Dmitry
  
   On Wed, Sep 14, 2011 at 9:10 AM, Rohit ro...@in-rev.com wrote:
  
Thanks Jaeger.
   
Actually I am storing twitter streaming data into the core, so the
 rate
   of
index is about 12tweets(docs)/second. The same solr contains 3 other
   cores
but these cores are not very heavy. Now the twitter core has become
  very
large (77516851) and its taking a long time to query (Mostly facet
   queries
based on date, string fields).
   
After sometime about 18-20hr solr goes out of memory, the thread dump
doesn't show anything. How can I improve this besides adding more ram
   into
the system.
   
   
   
Regards,
Rohit
Mobile: +91-9901768202
About Me: http://about.me/rohitg
   
-Original Message-
From: Jaeger, Jay - DOT [mailto:jay.jae...@dot.wi.gov]
Sent: 13 September 2011 21:06
To: solr-user@lucene.apache.org
Subject: RE: Out of memory
   
numDocs is not the number of documents in memory.  It is the number
 of
documents currently in the index (which is kept on disk).  Same goes
  for
maxDocs, except that it is a count of all of the documents that have
  ever
been in the index since it was created or optimized (including
 deleted
documents).
   
Your subject indicates that something is giving you some kind of Out
 of
memory error.  We might better be able to help you if you provide
 more
information about your exact problem.
   
JRJ
   
   
-Original Message-

Re: Count rows with tokens

2011-09-15 Thread tom135

Facet Indexing is good solution for me :)

Thanks for your help!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Count-rows-with-tokens-tp3274643p3338556.html
Sent from the Solr - User mailing list archive at Nabble.com.

can we share the same index directory for multiple cores?

2011-09-15 Thread kiran.bodigam

If we implement the  multi core functionality in solr is there any
possibility that the same index information shared by two different cores
(redundancy),can we share the same index directory for multiple cores?If i
query it on admin which core will respond because they suggesting to query
on different core http://localhost:8983/solr/core0/select?q=*:*  i don't
want to do this?
I would like to know how multi core functionality will work?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/can-we-share-the-same-index-directory-for-multiple-cores-tp3338571p3338571.html
Sent from the Solr - User mailing list archive at Nabble.com.

Distinct elements in a field

2011-09-15 Thread swiss knife

Simple question: I want to know how many distinct elements I have in a field 
and these verify a query. Do you know if there's a way to do it today in 3.4.

 I saw SOLR-1814 and SOLR-2242.

 SOLR-1814 seems fairly easy to use. What do you think ? Thank you

Delete documents with empty fields

2011-09-15 Thread Massimo Schiavon


I want to delete all documents with empty title field.
If i run the query -title:[* TO *] I obtain the correct list of 
documents but when I submit to solr the delete command:


curl http://localhost:8080/solr/web/update\?commit=true -H 
'Content-Type: text/xml' --data-binary \

'deletequery-title:[* TO *]/query/delete'

none of the documents were deleted.

After a bit of debugging I have noted that the query was internally 
rewritten by org.apache.lucene.search.Searcher.createNormalizedWeight to 
an empty query.


It is a bug or there is another way to do this operation? (or there is 
no way?)



Regards

Massimo

Re: Solandra - select query error

2011-09-15 Thread tom135

Hi Jake,

I was reproduce example of my error (commit release 3408a30):

1. I have used schema.xml from reuters-demo, with my fields definition:
.
fields
   field name=id type=long indexed=true stored=true required=true
/ 
   field name=text type=text indexed=true stored=true
termPositions=true/
   field name=doma_type type=long indexed=true stored=true
required=true /
   field name=sentiment_type type=long indexed=true stored=true
required=true /
   field name=date_check type=long indexed=true stored=true
required=true /
/fields
uniqueKeyid/uniqueKey
defaultSearchFieldtext/defaultSearchField
.

2. I have populated 1 items with two iterations (5000) to index
sampleIndex.sub

3. Then I execute many selects to above index sampleIndex.sub with many
combinations queries:
QueryResponse r = client.query(combination query);

- doma_type:(2) AND sentiment_type:(1) AND text:(piwo nie może) --
combination query - ERROR
- doma_type:(2 1) AND sentiment_type:(1) AND text:(piwo nie może)
- doma_type:(3 2 1) AND sentiment_type:(1) AND text:(piwo nie może)
- doma_type:(3 2 1) AND sentiment_type:(1) AND text:(może)
- doma_type:(3 2 1) AND sentiment_type:(1) AND text:(piwo nie)
- etc. (all combinations of numbers 1 2 3 and words piwo nie może)

4. In results, I have received error for a combination query (one from
above). The error combination query is not repeatable. This error does not
always occur. If error does not occur, then try above steps again (selects
should be performed immediately after index/write data).

I may have a bad configuration for this situation (I have standard
configuration).

MY CONSOLE ERROR:
org.apache.solr.client.solrj.SolrServerException: Error executing query
at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
at com.its.bt.solandra.dao.PostDao.countPosts(PostDao.java:383)
at com.its.bt.solandra.dao.PostDao.countPostsByTokens(PostDao.java:338)
at com.its.bt.solandra.dao.ProjectDao.main(ProjectDao.java:42)
Caused by: org.apache.solr.common.SolrException: 4 
java.lang.ArrayIndexOutOfBoundsException: 4 at
org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:310)  at
org.apache.lucene.search.ConjunctionScorer.score(ConjunctionScorer.java:135) 
at org.apache.lucene.search.BooleanScorer2$2.score(BooleanScorer2.java:182) 
at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:309) 
at
org.apache.lucene.search.TopScoreDocCollector$InOrderTopScoreDocCollector.collect(TopScoreDocCollector.java:47)
 
at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:281) 
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:526)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:320)   at
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1178)
 
at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1066)
 
at
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:358) 
at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:258)
 
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
 
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) 
at solandra.SolandraDispatchFilter.execute(SolandraDispatchFilter.java:171) 
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
 
at solandra.SolandraDispatchFilter.doFilter(SolandraDispatchFilter.java:137) 
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) 
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) 
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) 
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) 
at org.mortbay.jetty.Server.handle(Server.ja

4  java.lang.ArrayIndexOutOfBoundsException: 4  at
org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:310)  at
org.apache.lucene.search.ConjunctionScorer.score(ConjunctionScorer.java:135) 
at org.apache.lucene.search.BooleanScorer2$2.score(BooleanScorer2.java:182) 
at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:309) 
at
org.apache.lucene.search.TopScoreDocCollector$InOrderTopScoreDocCollector.collect(TopScoreDocCollector.java:47)
 
at org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:281) 
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:526)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:320)   at

Re: Delete documents with empty fields

2011-09-15 Thread Ahmet Arslan

 I want to delete all documents with
 empty title field.
 If i run the query -title:[* TO *] I obtain the correct
 list of documents but when I submit to solr the delete
 command:
 
 curl http://localhost:8080/solr/web/update\?commit=true -H
 'Content-Type: text/xml' --data-binary \
 'deletequery-title:[* TO
 *]/query/delete'
 
 none of the documents were deleted.
 
 After a bit of debugging I have noted that the query was
 internally rewritten by
 org.apache.lucene.search.Searcher.createNormalizedWeight to
 an empty query.
 
 It is a bug or there is another way to do this operation?
 (or there is no way?)

Not sure but 'deletequery+*:* -title:[* TO *]/query/delete' may do the 
trick.

Re: indexing data from rich documents - Tika with solr3.1

2011-09-15 Thread Erik Hatcher

Maybe this quick script will get you running?


http://www.lucidimagination.com/blog/2011/08/31/indexing-rich-files-into-solr-quickly-and-easily/


On Sep 15, 2011, at 00:44 , scorpking wrote:

 Hi Erick Erickson, 
 Now, we have many files format(doc, ppt, pdf, ...), File's purpose serve to
 search details content of education in that files. Because i am new solr, so
 maybe i understand not enough depth about Apache Tika. At the moment i can't
 index pdf files from http, with one file is ok. Thank for your attention. 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/indexing-data-from-rich-documents-Tika-with-solr3-1-tp3322555p3337963.html
 Sent from the Solr - User mailing list archive at Nabble.com.

How to write core's name in log

2011-09-15 Thread Joan

Hi,

I have multiple core in Solr and I want to write core name in log through to
lo4j.

I've found in SolrException a method called log(Logger log, Throwable e) but
when It try to build a Exception it haven't core's name.

The Exception is built in toStr() method in SolrException class, so I want
to write core's name in the message of Exception.

I'm thinking to add MDC variable, this will be name of core. Finally I'll
use it in log4j configuration like this in ConversionPattern %X{core}

The idea is that when Solr received a request I'll add this new variable
name of core.

But I don't know if it's a good idea or not.

or Do you already exists any solution for add name of core in log?

Thanks

Joan

RE: Out of memory

2011-09-15 Thread Rohit

Thanks Dmitry, let me look into sharading concepts.

Regards,
Rohit
Mobile: +91-9901768202
About Me: http://about.me/rohitg


-Original Message-
From: Dmitry Kan [mailto:dmitry@gmail.com] 
Sent: 15 September 2011 10:15
To: solr-user@lucene.apache.org
Subject: Re: Out of memory

If you have many users you could scale vertically, i.e. do replication. Buf
before that you could do sharding, for example by indexing entries based on
a hash function. Let's say split 69GB to two shards first and experiment
with it.

Regards,
Dmitry

On Thu, Sep 15, 2011 at 12:22 PM, Rohit ro...@in-rev.com wrote:

 It's happening more in search and search has become very slow particularly
 on the core with 69GB index data.

 Regards,
 Rohit

 -Original Message-
 From: Dmitry Kan [mailto:dmitry@gmail.com]
 Sent: 15 September 2011 07:51
 To: solr-user@lucene.apache.org
 Subject: Re: Out of memory

 Hello,
 Since you use caching, you can monitor the eviction parameter on the solr
 admin page (http://localhost:port/solr/admin/stats.jsp#cache). If it is
 non
 zero, the cache can be made e.g. bigger.
 queryResultWindowSize=50 in my case.
 Not sure, if solr 3.1 supports, but in 1.4 I have:
 HashDocSet maxSize=1000 loadFactor=0.75/

 Does the OOM happen on update/commit or search?

 Dmitry

 On Wed, Sep 14, 2011 at 2:47 PM, Rohit ro...@in-rev.com wrote:

  Thanks Dmirty for the offer to help, I am using some caching in one of
 the
  cores not. Earlier I was using on other cores too, but now I have
 commented
  them out because of frequent OOM, also some warming up in one of the
 core. I
  have share the links for my config files for all the 4 cores,
 
  http://haklus.com/crssConfig.xml
  http://haklus.com/rssConfig.xml
  http://haklus.com/twitterConfig.xml
  http://haklus.com/facebookConfig.xml
 
 
  Thanks again
  Rohit
 
 
  -Original Message-
  From: Dmitry Kan [mailto:dmitry@gmail.com]
  Sent: 14 September 2011 10:23
  To: solr-user@lucene.apache.org
  Subject: Re: Out of memory
 
  Hi,
 
  OK 64GB fits into one shard quite nicely in our setup. But I have never
  used
  multicore setup. In total you have 79,9 GB. We try to have 70-100GB per
  shard with caching on. Do you do warming up of your index on starting?
  Also,
  there was a setting of pre-populating the cache.
 
  It could also help, if you can show some parts of your solrconfig file.
  What
  is the solr version you use?
 
  Regards,
  Dmitry
 
  On Wed, Sep 14, 2011 at 11:38 AM, Rohit ro...@in-rev.com wrote:
 
   Hi Dimtry,
  
   To answer your questions,
  
   -Do you use caching?
   I do user caching, but will disable it and give it a go.
  
   -How big is your index in size on the disk?
   These are the size of the data folder for each of the cores.
   Core1 : 64GB
   Core2 : 6.1GB
   Core3 : 7.9GB
   Core4 : 1.9GB
  
   Will try attaching a jconsole to my solr as suggested to get a better
   picture.
  
   Regards,
   Rohit
  
  
   -Original Message-
   From: Dmitry Kan [mailto:dmitry@gmail.com]
   Sent: 14 September 2011 08:15
   To: solr-user@lucene.apache.org
   Subject: Re: Out of memory
  
   Hi Rohit,
  
   Do you use caching?
   How big is your index in size on the disk?
   What is the stack trace contents?
  
   The OOM problems that we have seen so far were related to the
   index physical size and usage of caching. I don't think we have ever
  found
   the exact cause of these problems, but sharding has helped to keep each
   index relatively small and OOM have gone away.
  
   You can also attach jconsole onto your SOLR via the jmx and monitor the
   memory / cpu usage in a graphical interface. I have also run garbage
   collector manually through jconsole sometimes and it was of a help.
  
   Regards,
   Dmitry
  
   On Wed, Sep 14, 2011 at 9:10 AM, Rohit ro...@in-rev.com wrote:
  
Thanks Jaeger.
   
Actually I am storing twitter streaming data into the core, so the
 rate
   of
index is about 12tweets(docs)/second. The same solr contains 3 other
   cores
but these cores are not very heavy. Now the twitter core has become
  very
large (77516851) and its taking a long time to query (Mostly facet
   queries
based on date, string fields).
   
After sometime about 18-20hr solr goes out of memory, the thread dump
doesn't show anything. How can I improve this besides adding more ram
   into
the system.
   
   
   
Regards,
Rohit
Mobile: +91-9901768202
About Me: http://about.me/rohitg
   
-Original Message-
From: Jaeger, Jay - DOT [mailto:jay.jae...@dot.wi.gov]
Sent: 13 September 2011 21:06
To: solr-user@lucene.apache.org
Subject: RE: Out of memory
   
numDocs is not the number of documents in memory.  It is the number
 of
documents currently in the index (which is kept on disk).  Same goes
  for
maxDocs, except that it is a count of all of the documents that have
  ever
been in the index since it was created or optimized

Replication and ExternalFileField

2011-09-15 Thread Per Osbeck

Hi all,

I'm trying to find some good information regarding replication, especially for 
the ExternalFileField.

As I understand it;
 - the external files must be in data dir.
 - replication only replicates data/indexes and possibly confFiles from the 
conf dir.

Does anyone have suggestions or ideas on how this should would work?

Best regards,
Per Osbeck

Re: Replication and ExternalFileField

2011-09-15 Thread Markus Jelsma

Perhaps a symlink will do the trick.

On Thursday 15 September 2011 14:04:47 Per Osbeck wrote:
 Hi all,
 
 I'm trying to find some good information regarding replication, especially
 for the ExternalFileField.
 
 As I understand it;
  - the external files must be in data dir.
  - replication only replicates data/indexes and possibly confFiles from the
 conf dir.
 
 Does anyone have suggestions or ideas on how this should would work?
 
 Best regards,
 Per Osbeck

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

RE: Replication and ExternalFileField

2011-09-15 Thread Per Osbeck

Probably would have worked on *nix but unfortunately running Windows.

Best regards,
Per

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: den 15 september 2011 14:07
To: solr-user@lucene.apache.org
Subject: Re: Replication and ExternalFileField

Perhaps a symlink will do the trick.

On Thursday 15 September 2011 14:04:47 Per Osbeck wrote:
 Hi all,

 I'm trying to find some good information regarding replication, 
 especially for the ExternalFileField.

 As I understand it;
  - the external files must be in data dir.
  - replication only replicates data/indexes and possibly confFiles 
 from the conf dir.

 Does anyone have suggestions or ideas on how this should would work?

 Best regards,
 Per Osbeck

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Delete documents with empty fields

2011-09-15 Thread Massimo Schiavon


On 15/09/2011 13:01, Ahmet Arslan wrote:

+*:* -title:[* TO *]


Worked fine.
Thanks a lot!


Massimo

Re: Index not getting refreshed

2011-09-15 Thread Mike Sokolov

Is it possible you have two solr instances running off the same index 
folder?  This was a mistake I stumbled into early on - I was writing 
with one, and reading with the other, so I didn't see updates.


-Mike

On 09/15/2011 12:37 AM, Pawan Darira wrote:

I am commiting but not doing replication now. Mine sort order also includes
last login timestamp. the new profiles are being reflected in my SOLR admin
  db. but its not listed on my website.

On Thu, Sep 15, 2011 at 4:25 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:

   

: I am using Solr 3.2 on a live website. i get live user's data of about
2000
: per day. I do an incremental index every 8 hours. but my search results
: always show the same result with same sorting order. when i check the
same

Are you commiting?

Are you using replication?

Are you using a sort order that might not make it obvious that the new
docs are actaully there? (ie: sort=timestamp asc)


-Hoss

Re: Terms.regex performance issue

2011-09-15 Thread O. Klein

Read  http://lucene.472066.n3.nabble.com/suggester-issues-td3262718.html
http://lucene.472066.n3.nabble.com/suggester-issues-td3262718.html  for more
info about the QueryConverter. IMO Suggester should make it easier to choose
between QueryConverters.

As for the infix, WIKI says its planned feature, but the Suggester hasnt't
been worked on for couple of months. So guess we will have to wait :)

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Terms-regex-performance-issue-tp3268994p3338899.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Norms - scoring issue

2011-09-15 Thread Adolfo Castro Menna

Hi Ashmet,

You're right. It was related to the text field which is the defaultSearch
field. I also added omitNorms=true in the fieldtype definition and it's now
working as expected

Thanks,
Adolfo.

Re: OOM issue

2011-09-15 Thread abhijit bashetti

Hi Eric,

Thanks for the reply. It is very useful for me.

For point 1. : I do need 10 core and it will go on increasing in future.

I have document that belongs to different workspaces , so the

1 workspace = 1 core ; I cant go with one core. Currrently having 10 core

but in future the count may go 40+.


For point 2.: Currently I have not given any thought on it , but yes I
think in future I

may have to go for the master/slave setup


For point 3: the current cache size for document cache , filter cache
and query cache is 512 for each

the ramBufferSizeMB size is 512M. Shall I reduce the same to 128M?

For point 4: I didnot get you why should I use SolrJ with Tika? Do you mean
sending the new/updated documents to Tika for reindexing?
Then I am already doing it using data-config. I have written the query in
data-config in way that it takes the path of updated/new documents.

Thanks in advance!

Regards,
Abhijit


Multiple webapps will not help you, they're still on the underlying
memory. In fact, it'll make matters worse since they won't share
resources.

So questions become:
1 Why do you have 10 cores? Putting 10 cores on the same machine
doesn't really do much. It can make lots of sense to put 10 cores on the
same machine for *indexing*, then replicate them out. But putting
10 cores on one machine in hopes of making better use of memory
isn't useful. It may be useful to just go to one core.

2 Indexing, reindexing and searching on a single machine is requiring a
lot from that machine. Really you should consider having a master/slave
setup.

3 But assuming more hardware of any sort isn't in the cards, sure. reduce
your cache sizes. Look at ramBufferSizeMB and make it small.

4 Consider indexing with Tika via SolrJ and only sending the finished
document to Solr.

Best
Erick

On Mon, Sep 12, 2011 at 5:42 AM, Manish Bafna manish.bafna...@gmail.com wrote:
 Number of cache is definitely going to reduce heap usage.

 Can you run those xlsx file separately with Tika and see if you are getting
 OOM issue.


 On Mon, Sep 12, 2011 at 3:09 PM, abhijit bashetti abhijitbashe...@gmail.com
 wrote:

 I am facing the OOM issue.

 OTHER than increasing the RAM , Can we chnage some other parameters to
 avoid the OOM issue.


 such as minimizing the filter cache size , document cache size etc.

 Can you suggest me some other option to avoid the OOM issue?


 Thanks in advance!


 Regards,

 Abhijit

Re: Performance troubles with solr

2011-09-15 Thread Yusuf Karakaya

Thank you all for your fast replies,
Changing photo_id:* to boolean has_photo field via transformer, when
importing data, *fixed my problems*; reducing query times to *30~ ms*.
I'll try to optimize furthermore by your advices on filter query usage and
int=tint (will search it first) transform.


On Thu, Sep 15, 2011 at 1:31 AM, Chris Hostetter
hossman_luc...@fucit.orgwrote:


 : q=photo_id:* AND gender:true AND country:MALAWI AND online:false

 photo_id:* does not mean what you probably think it means.  you most
 likely want photo_id:[* TO *] given your current schema, but i would
 recommend adding a new has_photo boolean field and using that instead.

 thta alone should explain a big part of what those queries would be slow.

 you didn't describe how your q param varies in your test queries (just
 your fq).  I'm assuming gender and online can vary, and that you
 sometimes don't use the photo_id clauses, and that the country clause
 can vary, but that these clauses are always all mandatory.

 in which case i would suggest using fq for all of them individually, and
 leaving your q param as *:* (unless you sometimes sort on the actual
 solr score, in which case leave it as whatever part of hte queyr you
 actually want to contribute to hte score)

 Lastly: I don't remember off the top of my head how int and tinit are
 defined in the example solrconfig files, but you should consider your
 usage of them carefully -- particularly with the precisionStep and which
 fields you do range queries on.



 -Hoss

Re: Schema fieldType y-m-d ?!?!

2011-09-15 Thread stockii

thx =) 

i think i will save this as an string if ranges really works =) 

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 
1 Core with 45 Million Documents other Cores  200.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Schema-fieldType-y-m-d-tp3335359p3339160.html
Sent from the Solr - User mailing list archive at Nabble.com.

Multiple shards on same machine find matches but return 0 results.

2011-09-15 Thread Aliya Virani

Hi,

Recently we have been trying to scale up our Solandra setup to make use of a
more powerful server. To improve query speeds we tried reducing the index
size, and thus increasing the number of shards on a single machine. While we
had no trouble searching and return results when we had a single shard, with
multiple shards we are getting an interesting bug where the search is able
to find the matching documents and returns an accurate count, but returns 0
documents. What could be causing this problem? Are we missing an obvious
parameter? While using multiple shards, if I set isShard=true, we do get
back results with total number found from only one shard.  We tried hitting
each of the cores directly by using the setParam from SolrJ, but are not
getting any results back.  We found the names of the core from the logs
(ie.10.1.10.200:8983/solandra/df~1).


To debug we have set up a test environment that has the latest release of
Solandra and uses the default settings. In this setting with just two shards
we are seeing the same issue. We have also tested with different size shards
ranging from 256 to 4194304. In all cases as soon as we have more than 1
shard Solandra stops returning results even though it is clear the results
were found. Below is some of the log information.

Server specs:
8 cores
32 GB of memory (though we are only allocating 16GB for Solandra)

Using the default settings in Solandra Properties, we added in 2,000,000
documents to ensure there were two shards on the same machine.

For a query that has not been cached:

DEBUG 10:28:55,004 core: df
DEBUG 10:28:55,005 Adding shard(df): 10.1.10.200:8983/solandra/df~0
DEBUG 10:28:55,005 Adding shard(df): 10.1.10.200:8983/solandra/df~1
DEBUG 10:28:55,014 Fetching 0 Docs
INFO 10:28:55,015 [df] webapp=/solandra path=/select
params={fl=key,scorestart=0q=province:azisShard=truewt=javabinfsv=truerows=10version=2}
hits=0 status=0 QTime=3
INFO 10:28:55,821 GC for ParNew: 258 ms, 586012000 reclaimed leaving
2122387984 used; max is 16955473920
DEBUG 10:28:58,034 Fetching 10 Docs
DEBUG 10:28:58,035 Going to bulk load 10 documents
DEBUG 10:28:58,099 Document read took: 63ms
INFO 10:28:58,099 [df] webapp=/solandra path=/select
params={fl=key,scorestart=0q=province:azisShard=truewt=javabinfsv=truerows=10version=2}
hits=99470 status=0 QTime=3087
DEBUG 10:28:58,101 Document read took: 1ms
DEBUG 10:28:58,102 Document read took: 1ms
DEBUG 10:28:58,104 Document read took: 1ms
DEBUG 10:28:58,105 Document read took: 1ms
DEBUG 10:28:58,107 Document read took: 2ms
DEBUG 10:28:58,108 Document read took: 1ms
DEBUG 10:28:58,109 Document read took: 1ms
DEBUG 10:28:58,110 Document read took: 1ms
DEBUG 10:28:58,112 Document read took: 1ms
DEBUG 10:28:58,113 Document read took: 1ms
DEBUG 10:28:58,118 Fetching 0 Docs
INFO 10:28:58,118 [df] webapp=/solandra path=/select
params={isShard=truewt=javabinq=province:azids=[us/az/yuma/1152s4thave],[us/az/tempe/208sriverdr],[us/az/mundspark/475pinewoodblvd],[us/az/phoenix/2338wstellaln],[us/az/tucson/3341wwildwooddr],[us/az/surprise/15128wbellrd],[us/az/phoenix/3222egeorgiaave],[us/az/lakehavasucity/2250catamarandr],[us/az/huachucacity/264shuachucablvd],[us/az/tucson/6161sparkave]version=2}
status=0 QTime=1
INFO 10:28:58,119 [df] webapp=/solandra path=/select
params={wt=javabinq=province:azversion=2} status=0 QTime=3115

For a query that has been cached:

DEBUG 10:27:36,350 core: df
INFO 10:27:36,351 ShardInfo for df has expired
INFO 10:27:36,353 Found reserved
shard1(106758077800188110322537822484278066430):178410 TO 180224
DEBUG 10:27:36,353 Adding shard(df): 10.1.10.200:8983/solandra/df~0
DEBUG 10:27:36,353 Adding shard(df): 10.1.10.200:8983/solandra/df~1
DEBUG 10:27:36,359 Fetching 0 Docs
INFO 10:27:36,360 [df] webapp=/solandra path=/select
params={fl=key,scorestart=0q=province:akisShard=truewt=javabinfsv=truerows=10version=2}
hits=0 status=0 QTime=2
DEBUG 10:27:36,362 Fetching 10 Docs
DEBUG 10:27:36,363 Found doc in cache
INFO 10:27:36,363 [df] webapp=/solandra path=/select
params={fl=key,scorestart=0q=province:akisShard=truewt=javabinfsv=truerows=10version=2}
hits=14707 status=0 QTime=5
DEBUG 10:27:36,363 Found doc in cache
DEBUG 10:27:36,363 Found doc in cache
DEBUG 10:27:36,363 Found doc in cache
DEBUG 10:27:36,364 Found doc in cache
DEBUG 10:27:36,364 Found doc in cache
DEBUG 10:27:36,364 Found doc in cache
DEBUG 10:27:36,364 Found doc in cache
DEBUG 10:27:36,364 Found doc in cache
DEBUG 10:27:36,365 Found doc in cache
DEBUG 10:27:36,365 Found doc in cache
DEBUG 10:27:36,369 Fetching 0 Docs
INFO 10:27:36,369 [df] webapp=/solandra path=/select
params={isShard=truewt=javabinq=province:akids=[us/ak/fairbanks/1483ballainerd],[us/ak/anchorage/4451etudorrd],[us/ak/anchorage/600cordovast],[us/ak/anchorage/6048e6thave],[us/ak/anchorage/940tyonekdr],[us/ak/fairbanks/3800universityaves],[us/ak/kenai/47189sherwoodcir],[us/ak/anchorage/12801oldsewardhwy],[us/ak/anchorage/8400raintreecir],[us/ak/juneau/9150skywoodln]version=2}
status=0

Re: glassfish, solrconfig.xml and SolrException: Error loading DataImportHandler

2011-09-15 Thread Xue-Feng Yang

Thanks for telling me this issue. However, I would think this is a bug. ^=^

From: Chris Hostetter hossman_luc...@fucit.org
To: solr-user@lucene.apache.org solr-user@lucene.apache.org; Xue-Feng Yang 
just4l...@yahoo.com
Sent: Wednesday, September 14, 2011 6:19:24 PM
Subject: Re: glassfish, solrconfig.xml  and SolrException: Error loading 
DataImportHandler

: References: 41dfe0136ddf091e98d45dea9f0da1ab@localhost
:  cab_8yd9obtkvkdktqpfnuzmey-afbzajyvgahh58+mccgiq...@mail.gmail.com
: Message-ID: 1316011545.626.yahoomail...@web110411.mail.gq1.yahoo.com
: Subject: glassfish, solrconfig.xml  and SolrException: Error loading
:  DataImportHandler
: In-Reply-To:
:     cab_8yd9obtkvkdktqpfnuzmey-afbzajyvgahh58+mccgiq...@mail.gmail.com

https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is hidden in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.

-Hoss

location of solr folder when deploy to servlet container

2011-09-15 Thread Kiwi de coder

hi,

how do i configure the solr folder to specific directory when deploy to
servlet container.

regards,
kiwi

Re: location of solr folder when deploy to servlet container

2011-09-15 Thread Markus Jelsma

In Tomcat you can set an environment var in Solr's context and set your home 
directory:

  Environment name=solr/home type=java.lang.String value=/opt/solr/ 


 hi,
 
 how do i configure the solr folder to specific directory when deploy to
 servlet container.
 
 regards,
 kiwi

Re: how would I use the new join feature given my schema.

2011-09-15 Thread Jason Toy

Anyone know the query I would do to get the join to work? I'm unable to get
it to work.

On Wed, Sep 14, 2011 at 10:49 AM, Jason Toy jason...@gmail.com wrote:

 I've been reading the information on the new join feature and am not quite
 sure how I would use it given my schema structure. I have User docs and
 BlogPost docs and I want to return all BlogPosts that match the fulltext
 title cool that belong to Users that match the description solr.

 Here are the 2 docs I have:


 ?xml version=1.0 encoding=UTF-8?add

 docfield name=class_nameUser/fieldfield
 name=login_sjtoy/fieldfield name=user_id_i192123/fieldfield
 name=description_texta solr user/field/field/doc

 docfield name=class_nameBlogPost/fieldfield
 name=user_id_i192123/fieldfield name=body_textthis is the
 description/fieldfield name=title_textthis is a cool
 title/field/field/doc

 /add?xml version=1.0 encoding=UTF-8?commit/


 Is it possible to do this with the join functionality? If not, how would I
 do this?

 I'd appreciate any pointers or help on this.


 Jason





-- 
- sent from my mobile
6176064373

query for point in time

2011-09-15 Thread gary tam

Hi

I have a scenario that I am not sure how to write the query for.

Here is the scenario - have an employee record with multi value for project,
started date, end date.

looks something like


John Smith web site bug fix   2010-01-01   2010-01-03
 unit testing  2010-01-04
2010-01-06
 QA support 2010-01-07
2010-01-12
 implementation   2010-01-13
 2010-01-22

I want to find what project John Smith was working on 2010-01-05

Is this possible or I have to back to my database ?


Thanks

Re: query for point in time

2011-09-15 Thread Jonathan Rochkind

You didn't tell us what your schema looks like, what fields with what 
types are involved.


But similar to how you'd do it in your database, you need to find 
'documents' that have a start date before your date in question, and an 
end date after your date in question, to find the ones whose range 
includes your date in question.


Something like this:

q=start_date:[* TO '2010-01-05'] AND end_date:['2010-01-05' TO *]

Of course, you need to add on your restriction to just documents about 
'John Smith', through another AND clause or an 'fq'.


But in general, if you've got a db with this info already, and this is 
all you need, why not just use the db?  Multi-hieararchy data like this 
is going to give you trouble in Solr eventually, you've got to arrange 
the solr indexes/schema to answer your questions, and eventually you're 
going to have two questions which require mutually incompatible schema 
to answer.


An rdbms is a great general purpose question answering tool for 
structured data.  lucene/Solr is a great indexing tool for text matching.


On 9/15/2011 2:55 PM, gary tam wrote:

Hi

I have a scenario that I am not sure how to write the query for.

Here is the scenario - have an employee record with multi value for project,
started date, end date.

looks something like


John Smith web site bug fix   2010-01-01   2010-01-03
  unit testing  2010-01-04
2010-01-06
  QA support 2010-01-07
2010-01-12
  implementation   2010-01-13
  2010-01-22

I want to find what project John Smith was working on 2010-01-05

Is this possible or I have to back to my database ?


Thanks

Re: Sorting on multiValued fields via function query

2011-09-15 Thread boneill42



Was there a solution here?  Is there a ticket related to the sort=max(FIELD)
solution?

-brian

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-on-multiValued-fields-via-function-query-tp2681833p3340145.html
Sent from the Solr - User mailing list archive at Nabble.com.

[DIH] How to use combine Regex and HTML transformers

2011-09-15 Thread Pulkit Singhal

Hello,

I need to pull out the price and imageURL for products in an Amazon RSS feed.

PROBLEM STATEMENT:
The following:
field column=description
   xpath=/rss/channel/item/description
   /
field column=price
   regex=.*?\$(\d*.\d*)
   sourceColName=description
   /
field column=imageUrl
   regex=.*?img src=quot;(.*?)quot;.*
   sourceColName=description
   /
works but I am left with html junk inside the description!

USELESS WORKAROUND:
If I try to strip the html from the data being fed into description
while letting the price and imageURL know of the direct path of the
RSS feed field like so:
field column=description
   xpath=/rss/channel/item/description
   stripHTML=true
   /
field column=price
   regex=.*?\$(\d*.\d*)
   xpath=/rss/channel/item/description
   /
field column=imageUrl
   regex=.*?img src=quot;(.*?)quot;.*
   xpath=/rss/channel/item/description
   /
then this fails and only the last configured field in this list
(imageURL) ends up having any data imported.
Is this a bug?

CRUX OF THE PROBLEM:
Also I tried to then create a field just to store the raw html data
like so but this configuration yields no results for the description
field so I'm back to where I started:
field column=rawDescription
   xpath=/rss/channel/item/description
   /
field column=description
   regex=.*
   sourceColName=rawDescription
   stripHTML=true
   /
field column=price
   regex=.*?\$(\d*.\d*)
   sourceColName=rawDescription
   /
field column=imageUrl
   regex=.*?img src=quot;(.*?)quot;.*
   sourceColName=rawDescription
   /
I was suspicious of trying to combine sourceColName with stripHTML to
begin with ... I suppose that I was hoping that the regex transformer
will run first and copy all the html data as-is which will then be
stripped out later by the HTML transformer but this didn't work. Why?
what else can I do?

Thanks!
- Pulkit

Re: Can index size increase when no updates/optimizes are happening?

2011-09-15 Thread Yury Kats

On 9/14/2011 2:36 PM, Erick Erickson wrote:
 What is the machine used for? Was your user looking at
 a master? Slave? Something used for both?

Stand-alone machine with multiple Solr cores. No replication.

 Measuring the size of all the files in the index? Or looking
 at memory?

Disk space.

 The index files shouldn't be getting bigger unless there
 were indexing operations going on. 

That's what I thought.

 Is it at all possible that
 DIH was configured to run automatically (or any other
 indexing job for that matter) and your user didn't realize it?

There's no DIH, but there is a custom app that submit docs
for indexing via SolrJ. Supposedly, Solr logs were not showing
any updates over night, so the assumption was that no new docs
were added.

I'd write it off as a user error, but wanted to double check with
the community that no other internal Solr/Lucene task can change the index
file size in the absence of submits.

Generating large datasets for Solr proof-of-concept

2011-09-15 Thread Pulkit Singhal

Hello Everyone,

I have a goal of populating Solr with a million unique products in
order to create a test environment for a proof of concept. I started
out by using DIH with Amazon RSS feeds but I've quickly realized that
there's no way I can glean a million products from one RSS feed. And
I'd go mad if I just sat at my computer all day looking for feeds and
punching them into DIH config for Solr.

Has anyone ever had to create large mock/dummy datasets for test
environments or for POCs/Demos to convince folks that Solr was the
wave of the future? Any tips would be greatly appreciated. I suppose
it sounds a lot like crawling even though it started out as innocent
DIH usage.

- Pulkit

Re: Generating large datasets for Solr proof-of-concept

2011-09-15 Thread Daniel Skiles

I've done it using SolrJ and a *lot *of of parallel processes feeding dummy
data into the server.

On Thu, Sep 15, 2011 at 4:54 PM, Pulkit Singhal pulkitsing...@gmail.comwrote:

 Hello Everyone,

 I have a goal of populating Solr with a million unique products in
 order to create a test environment for a proof of concept. I started
 out by using DIH with Amazon RSS feeds but I've quickly realized that
 there's no way I can glean a million products from one RSS feed. And
 I'd go mad if I just sat at my computer all day looking for feeds and
 punching them into DIH config for Solr.

 Has anyone ever had to create large mock/dummy datasets for test
 environments or for POCs/Demos to convince folks that Solr was the
 wave of the future? Any tips would be greatly appreciated. I suppose
 it sounds a lot like crawling even though it started out as innocent
 DIH usage.

 - Pulkit

RE: Replication and ExternalFileField

2011-09-15 Thread Jaeger, Jay - DOT

Actually, Windoze also has symbolic links.  You have to manipulate them from 
the command line, but they do exist.

http://en.wikipedia.org/wiki/NTFS_symbolic_link



-Original Message-
From: Per Osbeck [mailto:per.osb...@lbi.com] 
Sent: Thursday, September 15, 2011 7:15 AM
To: solr-user@lucene.apache.org
Subject: RE: Replication and ExternalFileField

Probably would have worked on *nix but unfortunately running Windows.

Best regards,
Per


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: den 15 september 2011 14:07
To: solr-user@lucene.apache.org
Subject: Re: Replication and ExternalFileField

Perhaps a symlink will do the trick.

On Thursday 15 September 2011 14:04:47 Per Osbeck wrote:
 Hi all,
 
 I'm trying to find some good information regarding replication, 
 especially for the ExternalFileField.
 
 As I understand it;
  - the external files must be in data dir.
  - replication only replicates data/indexes and possibly confFiles 
 from the conf dir.
 
 Does anyone have suggestions or ideas on how this should would work?
 
 Best regards,
 Per Osbeck

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Generating large datasets for Solr proof-of-concept

2011-09-15 Thread Markus Jelsma

If we want to test with huge amounts of data we feed portions of the internet. 
The problem is it takes a lot of bandwith and lots of computing power to get 
to a `reasonable` size. On the positive side, you deal with real text so it's 
easier to tune for relevance.

I think it's easier to create a simple XML generator with mock data, prices, 
popularity rates etc. It's fast to generate millions of mock products and once 
you have a large quantity of XML files, you can easily index, test, change 
config or schema and reindex.

On the other hand, the sample data that comes with the Solr example is a good 
set as well as it proves the concepts well, especially with the stock Velocity 
templates.

We know Solr will handle enormous sets but quantity is not always a part of a 
PoC.

 Hello Everyone,
 
 I have a goal of populating Solr with a million unique products in
 order to create a test environment for a proof of concept. I started
 out by using DIH with Amazon RSS feeds but I've quickly realized that
 there's no way I can glean a million products from one RSS feed. And
 I'd go mad if I just sat at my computer all day looking for feeds and
 punching them into DIH config for Solr.
 
 Has anyone ever had to create large mock/dummy datasets for test
 environments or for POCs/Demos to convince folks that Solr was the
 wave of the future? Any tips would be greatly appreciated. I suppose
 it sounds a lot like crawling even though it started out as innocent
 DIH usage.
 
 - Pulkit

Re: query for point in time

2011-09-15 Thread gary tam

Thanks for the reply.  We had the search within the database initially, but
it proven to be too slow.  With solr we have much better performance.

One more question, how could I find the most current job for each employee

My data looks like


John Smith  department A   web site bug fix   2010-01-01
2010-01-03
 unit testing
 2010-01-04   2010-01-06
 QA support
2010-01-07   2010-01-12
 implementation   2010-01-13
   2010-01-22

Jane Doe  department A  QA support 2010-01-01
2010-05-01
 implementation   2010-05-02
   2010-09-28

Joe Doe  department APHP development  2011-01-01
2011-08-31
 Java Development  2011-09-01
2011-09-15

I would like to return this as my search result

John Smith   department Aimplementation  2010-01-13
  2010-01-22
Jane Doe  department Aimplementation  2010-05-02
  2010-09-28
Joe Doedepartment AJava Development  2011-09-01
  2011-09-15


Thanks in advance
Gary



On Thu, Sep 15, 2011 at 3:33 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 You didn't tell us what your schema looks like, what fields with what types
 are involved.

 But similar to how you'd do it in your database, you need to find
 'documents' that have a start date before your date in question, and an end
 date after your date in question, to find the ones whose range includes your
 date in question.

 Something like this:

 q=start_date:[* TO '2010-01-05'] AND end_date:['2010-01-05' TO *]

 Of course, you need to add on your restriction to just documents about
 'John Smith', through another AND clause or an 'fq'.

 But in general, if you've got a db with this info already, and this is all
 you need, why not just use the db?  Multi-hieararchy data like this is going
 to give you trouble in Solr eventually, you've got to arrange the solr
 indexes/schema to answer your questions, and eventually you're going to have
 two questions which require mutually incompatible schema to answer.

 An rdbms is a great general purpose question answering tool for structured
 data.  lucene/Solr is a great indexing tool for text matching.


 On 9/15/2011 2:55 PM, gary tam wrote:

 Hi

 I have a scenario that I am not sure how to write the query for.

 Here is the scenario - have an employee record with multi value for
 project,
 started date, end date.

 looks something like


 John Smith web site bug fix   2010-01-01   2010-01-03
  unit testing  2010-01-04
 2010-01-06
  QA support 2010-01-07
 2010-01-12
  implementation   2010-01-13
  2010-01-22

 I want to find what project John Smith was working on 2010-01-05

 Is this possible or I have to back to my database ?


 Thanks

Re: query for point in time

2011-09-15 Thread Jonathan Rochkind


I think there's something wrong with your database then, but okay.

You still haven't said what your Solr schema looks like -- that list of 
values doesn't say what the solr field names or types are. I think this 
is maybe because you don't actually have a Solr database and have no 
idea how Solr works, you're just asking in theory? On the other hand, 
you just said you have better performance with solr -- I'm not sure how 
you were able to tell the performance of solr in answering these queries 
if you don't even know how to make them!


But, again, assuming your data is set up like i'm guessing it is, it's 
quite similar to what you'd do with an rdbms.


What does 'most current' mean? Can jobs be overlapping? To find the 
project with the latest start date for a given person, just limit to 
documents with that current person in a 'q' or 'fq', and then sort by 
start_date desc. Perhaps limit to 1 if you really only want one hit.  
Same principle as you would in an rdbms.


Again, this requires setting up your solr index in such a way to answer 
these sorts of questions. Each document in Solr will represent a 
person-project pair.  It'll have fields for person (or multiple fields, 
personID, personFirst, personLast, etc), project name, project start 
date, project end date.  This will make it easy/possible to answer 
questions like your examples with Solr, but will make it hard to answer 
many other sorts of questions -- unlike an rdbms, it is difficult to set 
up a Solr index that can flexibly answer just about any question you 
through at it, particularly when you have hieararchical or otherwise 
multi-entity data.


If you are interested, the standard Solr tutorial is pretty good: 
http://lucene.apache.org/solr/tutorial.html





On 9/15/2011 6:39 PM, gary tam wrote:

Thanks for the reply.  We had the search within the database initially, but
it proven to be too slow.  With solr we have much better performance.

One more question, how could I find the most current job for each employee

My data looks like


John Smith  department A   web site bug fix   2010-01-01
2010-01-03
  unit testing
  2010-01-04   2010-01-06
  QA support
2010-01-07   2010-01-12
  implementation   2010-01-13
2010-01-22

Jane Doe  department A  QA support 2010-01-01
2010-05-01
  implementation   2010-05-02
2010-09-28

Joe Doe  department APHP development  2011-01-01
2011-08-31
  Java Development  2011-09-01
 2011-09-15

I would like to return this as my search result

John Smith   department Aimplementation  2010-01-13
   2010-01-22
Jane Doe  department Aimplementation  2010-05-02
   2010-09-28
Joe Doedepartment AJava Development  2011-09-01
   2011-09-15


Thanks in advance
Gary



On Thu, Sep 15, 2011 at 3:33 PM, Jonathan Rochkindrochk...@jhu.edu  wrote:


You didn't tell us what your schema looks like, what fields with what types
are involved.

But similar to how you'd do it in your database, you need to find
'documents' that have a start date before your date in question, and an end
date after your date in question, to find the ones whose range includes your
date in question.

Something like this:

q=start_date:[* TO '2010-01-05'] AND end_date:['2010-01-05' TO *]

Of course, you need to add on your restriction to just documents about
'John Smith', through another AND clause or an 'fq'.

But in general, if you've got a db with this info already, and this is all
you need, why not just use the db?  Multi-hieararchy data like this is going
to give you trouble in Solr eventually, you've got to arrange the solr
indexes/schema to answer your questions, and eventually you're going to have
two questions which require mutually incompatible schema to answer.

An rdbms is a great general purpose question answering tool for structured
data.  lucene/Solr is a great indexing tool for text matching.


On 9/15/2011 2:55 PM, gary tam wrote:


Hi

I have a scenario that I am not sure how to write the query for.

Here is the scenario - have an employee record with multi value for
project,
started date, end date.

looks something like


John Smith web site bug fix   2010-01-01   2010-01-03
  unit testing  2010-01-04
2010-01-06
  QA support 2010-01-07
2010-01-12
  implementation   2010-01-13
  2010-01-22

I want to find what project John Smith was working on 2010-01-05

Is this possible or I have to back to my database ?


Thanks

Lucene-SOLR transition

2011-09-15 Thread Scott Smith

I've been using lucene for a number of years.  We've now decided to move to 
SOLR.  I have a couple of questions.


1.   I'm used to creating Boolean queries, filter queries, term queries, 
etc. for lucene.  Am I right in thinking that for SOLR my only option is 
creating string queries (with q and fq components) for solrj?

2.   Assuming that the answer to 1 is correct, then is there an easy way 
to take a lucene query (with nested Boolean queries, filter queries, etc.) and 
generate a SOLR query string with q and fq components?

Thanks

Scott

Re: Generating large datasets for Solr proof-of-concept

2011-09-15 Thread Pulkit Singhal

Ah missing } doh!

BTW I still welcome any ideas on how to build an e-commerce test base.
It doesn't have to be amazon that was jsut my approach, any one?

- Pulkit

On Thu, Sep 15, 2011 at 8:52 PM, Pulkit Singhal pulkitsing...@gmail.com wrote:
 Thanks for all the feedback thus far. Now to get  little technical about it :)

 I was thinking of feeding a file with all the tags of amazon that
 yield close to roughly 5 results each into a file and then running
 my rss DIH off of that, I came up with the following config but
 something is amiss, can someone please point out what is off about
 this?

    document
        entity name=amazonFeeds
                processor=LineEntityProcessor
                url=file:///xxx/yyy/zzz/amazonfeeds.txt
                rootEntity=false
                dataSource=myURIreader1
                transformer=RegexTransformer,DateFormatTransformer
                
            entity name=feed
                    pk=link
                    url=${amazonFeeds.rawLine
                    processor=XPathEntityProcessor
                    forEach=/rss/channel | /rss/channel/item

 transformer=RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow
 ...

 The rawline should feed into the url key but instead i get:

 Caused by: java.net.MalformedURLException: no protocol:
 null${amazonFeeds.rawLine
        at 
 org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90)

 Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 rollback
 INFO: start rollback

 Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter rollback
 SEVERE: Exception while solr rollback.

 Thanks in advance!

 On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma
 markus.jel...@openindex.io wrote:
 If we want to test with huge amounts of data we feed portions of the 
 internet.
 The problem is it takes a lot of bandwith and lots of computing power to get
 to a `reasonable` size. On the positive side, you deal with real text so it's
 easier to tune for relevance.

 I think it's easier to create a simple XML generator with mock data, prices,
 popularity rates etc. It's fast to generate millions of mock products and 
 once
 you have a large quantity of XML files, you can easily index, test, change
 config or schema and reindex.

 On the other hand, the sample data that comes with the Solr example is a good
 set as well as it proves the concepts well, especially with the stock 
 Velocity
 templates.

 We know Solr will handle enormous sets but quantity is not always a part of a
 PoC.

 Hello Everyone,

 I have a goal of populating Solr with a million unique products in
 order to create a test environment for a proof of concept. I started
 out by using DIH with Amazon RSS feeds but I've quickly realized that
 there's no way I can glean a million products from one RSS feed. And
 I'd go mad if I just sat at my computer all day looking for feeds and
 punching them into DIH config for Solr.

 Has anyone ever had to create large mock/dummy datasets for test
 environments or for POCs/Demos to convince folks that Solr was the
 wave of the future? Any tips would be greatly appreciated. I suppose
 it sounds a lot like crawling even though it started out as innocent
 DIH usage.

 - Pulkit

Re: Generating large datasets for Solr proof-of-concept

2011-09-15 Thread Pulkit Singhal

Thanks for all the feedback thus far. Now to get  little technical about it :)

I was thinking of feeding a file with all the tags of amazon that
yield close to roughly 5 results each into a file and then running
my rss DIH off of that, I came up with the following config but
something is amiss, can someone please point out what is off about
this?

document
entity name=amazonFeeds
processor=LineEntityProcessor
url=file:///xxx/yyy/zzz/amazonfeeds.txt
rootEntity=false
dataSource=myURIreader1
transformer=RegexTransformer,DateFormatTransformer

entity name=feed
pk=link
url=${amazonFeeds.rawLine
processor=XPathEntityProcessor
forEach=/rss/channel | /rss/channel/item

transformer=RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow
...

The rawline should feed into the url key but instead i get:

Caused by: java.net.MalformedURLException: no protocol:
null${amazonFeeds.rawLine
at 
org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90)

Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback

Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter rollback
SEVERE: Exception while solr rollback.

Thanks in advance!

On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
 If we want to test with huge amounts of data we feed portions of the internet.
 The problem is it takes a lot of bandwith and lots of computing power to get
 to a `reasonable` size. On the positive side, you deal with real text so it's
 easier to tune for relevance.

 I think it's easier to create a simple XML generator with mock data, prices,
 popularity rates etc. It's fast to generate millions of mock products and once
 you have a large quantity of XML files, you can easily index, test, change
 config or schema and reindex.

 On the other hand, the sample data that comes with the Solr example is a good
 set as well as it proves the concepts well, especially with the stock Velocity
 templates.

 We know Solr will handle enormous sets but quantity is not always a part of a
 PoC.

 Hello Everyone,

 I have a goal of populating Solr with a million unique products in
 order to create a test environment for a proof of concept. I started
 out by using DIH with Amazon RSS feeds but I've quickly realized that
 there's no way I can glean a million products from one RSS feed. And
 I'd go mad if I just sat at my computer all day looking for feeds and
 punching them into DIH config for Solr.

 Has anyone ever had to create large mock/dummy datasets for test
 environments or for POCs/Demos to convince folks that Solr was the
 wave of the future? Any tips would be greatly appreciated. I suppose
 it sounds a lot like crawling even though it started out as innocent
 DIH usage.

 - Pulkit

ClassCastException: SmartChineseWordTokenFilterFactory to TokenizerFactory

2011-09-15 Thread Xue-Feng Yang

Hi all,

I am trying to use SmartChineseWordTokenFilterFactory in solr 3.4.0, but come 
to the error

SEVERE: java.lang.ClassCastException: 
org.apache.solr.analysis.SmartChineseWordTokenFilterFactory cannot be cast to 
org.apache.solr.analysis.TokenizerFactory


Any thought?

Re: hi. allowLeadingWildcard is it possible or not yet?

2011-09-15 Thread deniz

i wonder the same thing... so wanna re-animate the topic 

is it possible?

-
Zeki ama calismiyor... Calissa yapar...
--
View this message in context: 
http://lucene.472066.n3.nabble.com/hi-allowLeadingWildcard-is-it-possible-or-not-yet-tp495457p3340838.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Generating large datasets for Solr proof-of-concept

2011-09-15 Thread Lance Norskog

http://aws.amazon.com/datasets

DBPedia might be the easiest to work with:
http://aws.amazon.com/datasets/2319

Amazon has a lot of these things.
Infochimps.com is a marketplace for free  pay versions.


Lance

On Thu, Sep 15, 2011 at 6:55 PM, Pulkit Singhal pulkitsing...@gmail.comwrote:

 Ah missing } doh!

 BTW I still welcome any ideas on how to build an e-commerce test base.
 It doesn't have to be amazon that was jsut my approach, any one?

 - Pulkit

 On Thu, Sep 15, 2011 at 8:52 PM, Pulkit Singhal pulkitsing...@gmail.com
 wrote:
  Thanks for all the feedback thus far. Now to get  little technical about
 it :)
 
  I was thinking of feeding a file with all the tags of amazon that
  yield close to roughly 5 results each into a file and then running
  my rss DIH off of that, I came up with the following config but
  something is amiss, can someone please point out what is off about
  this?
 
 document
 entity name=amazonFeeds
 processor=LineEntityProcessor
 url=file:///xxx/yyy/zzz/amazonfeeds.txt
 rootEntity=false
 dataSource=myURIreader1
 transformer=RegexTransformer,DateFormatTransformer
 
 entity name=feed
 pk=link
 url=${amazonFeeds.rawLine
 processor=XPathEntityProcessor
 forEach=/rss/channel | /rss/channel/item
 
 
 transformer=RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow
  ...
 
  The rawline should feed into the url key but instead i get:
 
  Caused by: java.net.MalformedURLException: no protocol:
  null${amazonFeeds.rawLine
 at
 org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90)
 
  Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2
 rollback
  INFO: start rollback
 
  Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter
 rollback
  SEVERE: Exception while solr rollback.
 
  Thanks in advance!
 
  On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma
  markus.jel...@openindex.io wrote:
  If we want to test with huge amounts of data we feed portions of the
 internet.
  The problem is it takes a lot of bandwith and lots of computing power to
 get
  to a `reasonable` size. On the positive side, you deal with real text so
 it's
  easier to tune for relevance.
 
  I think it's easier to create a simple XML generator with mock data,
 prices,
  popularity rates etc. It's fast to generate millions of mock products
 and once
  you have a large quantity of XML files, you can easily index, test,
 change
  config or schema and reindex.
 
  On the other hand, the sample data that comes with the Solr example is a
 good
  set as well as it proves the concepts well, especially with the stock
 Velocity
  templates.
 
  We know Solr will handle enormous sets but quantity is not always a part
 of a
  PoC.
 
  Hello Everyone,
 
  I have a goal of populating Solr with a million unique products in
  order to create a test environment for a proof of concept. I started
  out by using DIH with Amazon RSS feeds but I've quickly realized that
  there's no way I can glean a million products from one RSS feed. And
  I'd go mad if I just sat at my computer all day looking for feeds and
  punching them into DIH config for Solr.
 
  Has anyone ever had to create large mock/dummy datasets for test
  environments or for POCs/Demos to convince folks that Solr was the
  wave of the future? Any tips would be greatly appreciated. I suppose
  it sounds a lot like crawling even though it started out as innocent
  DIH usage.
 
  - Pulkit
 
 




-- 
Lance Norskog
goks...@gmail.com

Re: ClassCastException: SmartChineseWordTokenFilterFactory to TokenizerFactory

2011-09-15 Thread Lance Norskog

Tokenizers and TokenFilters are different. Look in the schema for how other
TokenFilterFactory classes are used.

On Thu, Sep 15, 2011 at 8:05 PM, Xue-Feng Yang just4l...@yahoo.com wrote:

 Hi all,

 I am trying to use SmartChineseWordTokenFilterFactory in solr 3.4.0, but
 come to the error

 SEVERE: java.lang.ClassCastException:
 org.apache.solr.analysis.SmartChineseWordTokenFilterFactory cannot be cast
 to org.apache.solr.analysis.TokenizerFactory


 Any thought?




-- 
Lance Norskog
goks...@gmail.com

51 matches

Mail list logo