Re: Term searches with colon(:)

2012-09-07 Thread Chris Hostetter

: I was wondering if anybody has run into this issue before. Solr is not
: returing any search results for word that contain  colon ( : ) in it
: when we perform a term search containing colon.  We do escape this
: correctly, I believe as shown in the sample (taken from tomcat logs)
...
: INFO: [] webapp=/X path=/select
: params={q=+(*\:*)+&rows=100&version=2.2} hits=0 status=0 QTime=0 

what are you expecting that query to match?  because by backslash escpaing 
the colon, what you are asking for there is for Solr to search for the 
literal string "*:*" in your default search field (afterwhatever query 
time analysis is configured on your default search field)

: On the other hand if we prefix the term with a field name, we get
: correct results as shown below
...
: INFO: [] webapp=/X path=/select
: params={q=(+description:*\:*)&rows=100&version=2.2} hits=1 status=0
: QTime=0 

perhaps the problem is that you are expecting "description" to be your 
default search field but it is not configured that way?


-Hoss


Term searches with colon(:)

2012-09-07 Thread Nemani, Raj
All,

I was wondering if anybody has run into this issue before. Solr is not
returing any search results for word that contain  colon ( : ) in it
when we perform a term search containing colon.  We do escape this
correctly, I believe as shown in the sample (taken from tomcat logs)

 

Sep 06, 2012 11:30:01 PM org.apache.solr.core.SolrCore execute

INFO: [] webapp=/X path=/select
params={q=+(*\:*)+&rows=100&version=2.2} hits=0 status=0 QTime=0 

 

On the other hand if we prefix the term with a field name, we get
correct results as shown below

 

Sep 06, 2012 11:30:29 PM org.apache.solr.core.SolrCore execute

INFO: [] webapp=/X path=/select
params={q=(+description:*\:*)&rows=100&version=2.2} hits=1 status=0
QTime=0 

 

Did anybody encounter this behavior before?

 

Any help is appreciated and Thank you in advance

 

Raj

 

 

 



Re: N-gram ranking based on term position

2012-09-07 Thread Kiran Jayakumar
Since Edge N-gram tokens are a subset of N-gram tokens, I was wondering if
I could be a bit more space efficient.

On Fri, Sep 7, 2012 at 3:07 PM, Amit Nithian  wrote:

> I think your thought about using the edge ngram as a field and
> boosting that field in the qf/pf sections of the dismax handler sounds
> reasonable. Why do you have qualms about it?
>
> On Fri, Sep 7, 2012 at 12:28 PM, Kiran Jayakumar 
> wrote:
> > Hi,
> >
> > Is it possible to score documents with a match "early in the text" higher
> > than "later in the text" ? I want to boost "begin with" matches higher
> than
> > the "contains" matches. I can define a copy field and analyze it as edge
> > n-gram and boost it. I was wondering if there was a better way to do it.
> >
> > Thanks
>


Re: N-gram ranking based on term position

2012-09-07 Thread Amit Nithian
I think your thought about using the edge ngram as a field and
boosting that field in the qf/pf sections of the dismax handler sounds
reasonable. Why do you have qualms about it?

On Fri, Sep 7, 2012 at 12:28 PM, Kiran Jayakumar  wrote:
> Hi,
>
> Is it possible to score documents with a match "early in the text" higher
> than "later in the text" ? I want to boost "begin with" matches higher than
> the "contains" matches. I can define a copy field and analyze it as edge
> n-gram and boost it. I was wondering if there was a better way to do it.
>
> Thanks


RE: [Solr4 beta] error 503 on commit

2012-09-07 Thread Markus Jelsma
Hi,

We've seen this too on one of the test nodes yesterday, it ran on a build of a 
few days old. The node receiving documents complained it could not forward them 
to the fifth node and returned a 503. The fifth node itself only logged a NPE 
and the 503, nothing more, no stack traces.

There was no heavy committing going on (we only use auto commit of a few 
seconds) and it were only about 500 documents. The first two batches were 
accepted but then it died. It failed very consistently. Perhaps it accepted the 
first batches because no document was sent to the shard leader on that node yet.

Restarting the node fixed the trouble.


-Original message-
> From:Chris Hostetter 
> Sent: Fri 07-Sep-2012 23:02
> To: solr-user@lucene.apache.org
> Subject: Re: [Solr4 beta] error 503 on commit
> 
> 
> : I get sometimes (not often):
> : SolrException e  where   e.code() ==
> : SolrException.ErrorCode.SERVICE_UNAVAILABLE.code
> 
> Are there any errors in your solr server logs?
> Are you using the DistributedUpdateProcessor (ie: SolrCloud) ?
> 
> There aren't many places in Solr that will throw a 503 status code, if i 
> had to guess i would suspect that your problem is you are committing too 
> often relative to the amount of warming you are doing, and exceeding the 
> max number of open searchers...
> 
> https://wiki.apache.org/solr/FAQ#What_does_.22exceeded_limit_of_maxWarmingSearchers.3DX.22_mean.3F
> 
> : When I catch this exception, I try to commit again, the call doesn't throw,
> : but the docs are not committed. Am I supposed to add docs again before
> 
> Hmmm... that is odd, i think we'd definitely need to see the logs from one 
> of these errors (and the lines leading up to it) to understand what might 
> be happening.
> 
> -Hoss
> 


Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0

2012-09-07 Thread Shawn Heisey

On 9/6/2012 6:54 PM, kiran chitturi wrote:

The error i am getting is 'org.apache.solr.common.SolrException: Invalid
Date String: '1345743552'.

  I think it was being saved as a string in DB, so i will use the
DateFormatTransformer.


To go along with all the other replies that you have gotten:  I import 
from MySQL with a unix format date field.  It's a bigint, not a string, 
but a quick test on MySQL 5.1 shows that the function works with strings 
too.  This is how my SELECT handles that field - I have MySQL convert it 
before it gets to Solr:


from_unixtime(`d`.`post_date`) AS `pd`

When it comes to the character set issues, this is how I have defined 
the driver in the dataimport config.  The character set in the database 
is utf8.


  

Thanks,
Shawn



Re: Why is using edismax in Admin UI puts edismax=true but not defType=edismax?

2012-09-07 Thread Chris Hostetter

: I am not edismax=true as a flag actually does anything (Solr4 beta):

Alexandre: You are 100% correct, this appears to be a bug in the Admin 
UI.  Thank you for reporting it...

https://issues.apache.org/jira/browse/SOLR-3811


-Hoss


Why is using edismax in Admin UI puts edismax=true but not defType=edismax?

2012-09-07 Thread Alexandre Rafalovitch
Hello,

I am not edismax=true as a flag actually does anything (Solr4 beta):

'responseHeader'=>{
'status'=>0,
'QTime'=>1,
'params'=>{
  'debugQuery'=>'true',
  'indent'=>'true',
  'edismax'=>'true',
  'q'=>'text',
  'qf'=>'TitleEN DescEN',
  'wt'=>'ruby',
  'rows'=>'0'}},
  'response'=>{'numFound'=>34,'start'=>0,'docs'=>[]
  },
  'debug'=>{
'rawquerystring'=>'text',
'querystring'=>'text',
'parsedquery'=>'TitleEN:text',
'parsedquery_toString'=>'TitleEN:text',
'explain'=>{},
'QParser'=>'LuceneQParser',
'timing'=>{


When I do defType=edismax, I get instead:

'QParser'=>'ExtendedDismaxQParser',


Is my setup not configured properly for the flag to work?

Regards,
  Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


Re: [Solr4 beta] error 503 on commit

2012-09-07 Thread Chris Hostetter

: I get sometimes (not often):
: SolrException e  where   e.code() ==
: SolrException.ErrorCode.SERVICE_UNAVAILABLE.code

Are there any errors in your solr server logs?
Are you using the DistributedUpdateProcessor (ie: SolrCloud) ?

There aren't many places in Solr that will throw a 503 status code, if i 
had to guess i would suspect that your problem is you are committing too 
often relative to the amount of warming you are doing, and exceeding the 
max number of open searchers...

https://wiki.apache.org/solr/FAQ#What_does_.22exceeded_limit_of_maxWarmingSearchers.3DX.22_mean.3F

: When I catch this exception, I try to commit again, the call doesn't throw,
: but the docs are not committed. Am I supposed to add docs again before

Hmmm... that is odd, i think we'd definitely need to see the logs from one 
of these errors (and the lines leading up to it) to understand what might 
be happening.

-Hoss


Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Tom Burton-West
Thanks Robert,

>>if not, just customize blocktree's params with a CodecFactory in solr,
>>or even pick another implementation (FixedGap, VariableGap, whatever).

Still trying to get my head around 4.0 and flexible indexing.  I'll take
another look at Mike's and your presentations.  I'm trying to figure out
how to get from the Lucene JavaDocs you pointed out  to how to specify
things in Solr and it's config files..

Is there an example CodecFactory somewhere I could look at?  Also is
Is there an example somewhere of how to specify a CodecFactory/Codec in
Solr using the schema.xml or solrconfig.xml?

Is there some simple way to specify minBlockSize and maxBlockSize in
schema.xml?

Once I get this all working and understand it, I'll be happy to draft some
documentation.

I'm really looking forward to experimenting with 4.0!

Tom



Tom
On Fri, Sep 7, 2012 at 2:58 PM, Robert Muir  wrote:

> On Fri, Sep 7, 2012 at 2:19 PM, Tom Burton-West 
> wrote:
> > Thanks Robert,
> >
> > I'll have to spend some time understanding the default codec for Solr
> 4.0.
> > Did I miss something in the changes file?
>
> http://lucene.apache.org/core/4_0_0-BETA/
>
> see the file formats section, especially
>
> http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Termdictionary
>
> (since blocktree "covers" term dictionary and terms index)
>
> >
> >  I'll be digging into the default codec docs and testing sometime in next
> > week  or two (with a 2 billion term index)  If I understand it well
> enough,
> > I'll be happy to draft some changes up for either the wiki or Solr the
> > example solrconfig.xml  file.
>
> right i think we should remove these parameters.
>
> >
> > Does this mean that the default codec will reduce memory use for the
> terms
> > index enough so I don't need to use either of these settings to deal with
> > my > 2 billion term indexes?
>
> probably. i dont know enough about your terms or how much RAM you have
> to say for sure.
>
> if not, just customize blocktree's params with a CodecFactory in solr,
> or even pick another implementation (FixedGap, VariableGap, whatever).
>
> the interval/divisor stuff is mostly only useful if you are not
> reindexing from scratch: e.g. if you are gonna plop your 3.x index
> into 4.x then you should set
> those to whatever you were using before, since it will be using
> PreflexCodec to read those.
>
> --
> lucidworks.com
>


Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0

2012-09-07 Thread Chris Hostetter

: > When i index a text field which has arabic and English like this tweet
: > “@anaga3an: هو سعد الحريري بيعمل ايه غير تحديد الدوجلاس ويختار الكرافته ؟؟”
: > #gcc #ksa #lebanon #syria #kuwait #egypt #سوريا
: > with field_type as 'text_ar' and when i try to see the same field again in
: > solr, it is shown as below.
: > RT @AhmedWagih: لو معملناش حاجة Ù???ÙŠ الزيادة
: > السكانية Ù???ÙŠ مصر، هنتحول لدولة Ù???قيرة
: > كثيÙ???Ø© السكان زي بنجلادش #Egypt #EgyEconomy

: The encoding of your input text is being mangled at some point.
: Presuming that your original encoding is UTF-8, I would look at
: how you are indexing into Solr, and the encoding settings on the
: Java container. Solr itself handles UTF-8 perfectly fine, as do
: most Java containers if configured properly, so my first suspicion
: would be the indexing code.

right -- the key thing is to narrow down wether the charset of your data 
is getting mangled between the db -> solr or between solr -> your eyes

I would suggest you start by looking at some of the sample documents that 
come with solr which include non ASCII characters, and indexing those 
using the post.jar that is provided.  if those show up fine for you in 
solr, then your servlet container probably isn't doing the munging -- 
there is also a "test_utf8.sh" in the exampledocs directory that can help 
you verify if your servlet container is working properly.

If you rule that out, then the next step is to look at your database, and 
the way your JDBC driver (what DIH uses to talk to your database) is 
working.

Some databases have the concept of a "default charset" but then individual 
columns can override that with some other charset, and database 
specific commandline tools know might know about those (so your data looks 
fine when you run SQL statements directly) but external clients have no 
way of knowing unless specially configured.

For example: the MySQL jdbc driver has some special options you can 
use to force it to use unicode and to specify which charset to use 
when returning data...

https://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html



-Hoss

Re: Solr search not working after copying a new field to an existing Indexed Field

2012-09-07 Thread Mani
yes..I do have this uniquekey defined properly.

id


Before the schema change...




After the schema change...









--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-search-not-working-after-copying-a-new-field-to-an-existing-Indexed-Field-tp4005993p4006217.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr search not working after copying a new field to an existing Indexed Field

2012-09-07 Thread Kiran Jayakumar
Do you have the unique key set up in your schema.xml ? It should be
automatic if you have the ID field and define it as the unique key.

 ID

On Thu, Sep 6, 2012 at 11:50 AM, Mani  wrote:

> I have a made a schema change to copy an existing field "name" (Source
> Field)
> to an existing search field "text" (Destination Field).
>
> Since I made the schema change, I updated all the documents thinking the
> new
> source field will be clubbed together with the "text" field.  The search
> for
> a specific name does not return results.
>
> If I delete the document and then adding the document back works just fine.
>
> I thought Add command with default override option will work as Delete and
> Add.
>
> Is this the only way to reindex the "text" field? Is there anyother method?
>
> I really appreciate your help on this!
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-search-not-working-after-copying-a-new-field-to-an-existing-Indexed-Field-tp4005993.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Problem with verifying signature ?

2012-09-07 Thread Kiran Jayakumar
Thank you.

On Thu, Sep 6, 2012 at 9:51 AM, Chris Hostetter wrote:

>
> : gpg: Signature made 08/06/12 19:52:21 Pacific Daylight Time using RSA key
> : ID 322
> : D7ECA
> : gpg: Good signature from "Robert Muir (Code Signing Key) <
> rm...@apache.org>"
> : *gpg: WARNING: This key is not certified with a trusted signature!*
> : gpg:  There is no indication that the signature belongs to the
> : owner.
> : Primary key fingerprint: 6661 9BA3 C030 DD55 3625  1303 817A E1DD 322D
> 7ECA
> :
> : Is this acceptable ?
>
> I guess it depends on what you mean by acceptible?
>
> I'm not an expert on this, but as i understand it...
>
> gpg is telling you that it confirmed the signature matches a known key
> named "Robert Muir (Code Signing Key)" which is in your keyring, but that
> there is no certified level of trust association with that key.
>
> Key Trust is a personal thing, specific to you, your keyring, and how you
> got the keys you put in that ring.  if you trust that the KEYS file you
> downloaded from apache.org is legitimate, and that all the keys in it
> should be trusted, you can tell gpg that.  (using the "trust"
> interactive command when using --edit-key)
>
> Alternatively, you could tell gpg that you have a high level of trust in
> the key of some other person you have met personally -- ie: if you met Uwe
> at a confernce and he physically handed you his key on a USB drive -- and
> then if Uwe has signed Robert's key with his own (i think it has, not sure
> off the top of my head), then gpg would extend an implicit transitive
> trust to Robert's key...
>
> http://www.gnupg.org/gph/en/manual.html#AEN335
>
>
> -Hoss
>


Re: How to preserve source column names in multivalue catch all field

2012-09-07 Thread Kiran Jayakumar
Thank you Erick. I think #2 is the best for me because I have more than
hundred fields & dont want to construct a huge query each time.

On Thu, Sep 6, 2012 at 9:38 PM, Erick Erickson wrote:

> Try using edismax to distribute the search across the fields rather
> than using the catch-all field. There's no way that I know of to
> reconstruct what field the source was.
>
> But storing the source fields without indexing them is OK too, it won't
> affect
> searching speed noticeably...
>
> Best
> Erick
>
> On Tue, Sep 4, 2012 at 11:52 AM, Kiran Jayakumar 
> wrote:
> > Hi everyone,
> >
> > I have got a multivalue catch all field which captures all the text
> fields.
> > Whats the best way to preserve the column information also ? In the UI, I
> > need to show  :  type output. Right now, I am storing the
> > source fields without indexing. Is there a better way to do it ?
> >
> > Thanks
>


[Solr4 beta] error 503 on commit

2012-09-07 Thread Antoine LE FLOC'H
Hello,

Using "package org.apache.solr.client.solrj;"

when I do:

UpdateResponse ur = solrServer.commit(false, false);

I get sometimes (not often):
SolrException e  where   e.code() ==
SolrException.ErrorCode.SERVICE_UNAVAILABLE.code

When I catch this exception, I try to commit again, the call doesn't throw,
but the docs are not committed. Am I supposed to add docs again before
committing
solrServer.add(docs);
What am I supposed to do basically ?

Please note that I was not getting this on the Solr4 trunk from december,
but I am getting it sometimes (not systematic) since the Solr4 alpha.

Thanks for your help.


Re: Re: Schema model to store additional field metadata

2012-09-07 Thread sysrq
> Why would you store the actual images in SOLR?

No, the images are files on the filesystem. Only the path to the image should 
be stored in Solr.

> And you are most likely looking at dynamic fields as the solution
> 
> 1) Define *_Path, *_Size, *_Alt as a dynamic field with appropriate types
> 2) During indexing, write those properties as Image_1_Path,
> Image_1_Size, Image_1_Alt or some such
> 3) Make sure that whatever search algorithm you have looks at those or
> do a copyField to aggregate them into AllImage_Alt, etc.

I was also thinking of a solution with dynamic fields, but I am very new to 
Solr and I am not sure if it is a good solution to solve this modelling issue. 
For example I thought about introducing two multiValued dynamic fields 
(image_src_*, image_alt_*) and store image data like file path on disc and 
alt-attribute like this:

title: An article about Foo and Bar
content:   This is some text about Foo and Bar.
published: 2012.09.07T19:23
image_src_1: 2012/09/foo.png
image_alt_1: Foo. Waiting for the bus.
image_src_2: 2012/04/images/bar.png
image_src_3: 2012/02/abc.png
image_alt_3: Foo and Bar at the beach

Of course a alt attribute for some images could be missing. I don't know if 
this is a good or better solution for this. It feels clumsy to me, like a 
workaround. But maybe this is the way to model this data, I don't know?


Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Robert Muir
On Fri, Sep 7, 2012 at 2:19 PM, Tom Burton-West  wrote:
> Thanks Robert,
>
> I'll have to spend some time understanding the default codec for Solr 4.0.
> Did I miss something in the changes file?

http://lucene.apache.org/core/4_0_0-BETA/

see the file formats section, especially
http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/codecs/lucene40/Lucene40PostingsFormat.html#Termdictionary

(since blocktree "covers" term dictionary and terms index)

>
>  I'll be digging into the default codec docs and testing sometime in next
> week  or two (with a 2 billion term index)  If I understand it well enough,
> I'll be happy to draft some changes up for either the wiki or Solr the
> example solrconfig.xml  file.

right i think we should remove these parameters.

>
> Does this mean that the default codec will reduce memory use for the terms
> index enough so I don't need to use either of these settings to deal with
> my > 2 billion term indexes?

probably. i dont know enough about your terms or how much RAM you have
to say for sure.

if not, just customize blocktree's params with a CodecFactory in solr,
or even pick another implementation (FixedGap, VariableGap, whatever).

the interval/divisor stuff is mostly only useful if you are not
reindexing from scratch: e.g. if you are gonna plop your 3.x index
into 4.x then you should set
those to whatever you were using before, since it will be using
PreflexCodec to read those.

-- 
lucidworks.com


Solr 4: Private master, public slave?

2012-09-07 Thread Alexandre Rafalovitch
Hello,

I have a bunch of documents that I would like to index on a local
server behind the firewall. But then, the actual search will happen on
a public infrastructure (Amazon, etc). The documents themselves are
not quite public, so I want just the index content (indexed, not
stored) being available outside the firewall.

Is that something that is doable with Solr Cloud or index copying, etc?

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


Re: Version Migration from solr 1.3

2012-09-07 Thread Mani
If you have time, you might as well wait for 4.0 to be released otherwise
3.6.1



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Version-Migration-from-solr-1-3-tp4006193p4006200.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Tom Burton-West
Thanks Robert,

I'll have to spend some time understanding the default codec for Solr 4.0.
Did I miss something in the changes file?

 I'll be digging into the default codec docs and testing sometime in next
week  or two (with a 2 billion term index)  If I understand it well enough,
I'll be happy to draft some changes up for either the wiki or Solr the
example solrconfig.xml  file.

Does this mean that the default codec will reduce memory use for the terms
index enough so I don't need to use either of these settings to deal with
my > 2 billion term indexes?

If both of these parameters don't make sense for the default codec, then
maybe they need to be commented out or removed from the solr example
solrconfig.xml.

Tom

On Fri, Sep 7, 2012 at 1:33 PM, Robert Muir  wrote:

> Hi Tom: I already enhanced the javadocs about this for Lucene, putting
> warnings everywhere in bold:
>
> NOTE: This parameter does not apply to all PostingsFormat
> implementations, including the default one in this release. It only
> makes sense for term indexes that are implemented as a fixed gap
> between terms.
> NOTE: divisor settings > 1 do not apply to all PostingsFormat
> implementations, including the default one in this release. It only
> makes sense for terms indexes that can efficiently re-sample terms at
> load time.
> etc
>
>
> http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
>
> http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/index/DirectoryReader.html#open%28org.apache.lucene.store.Directory,%20int%29
>
> In the future I expect these parameters ill be removed completely:
> anything like this is specific to the codec/implementation.
>
> In Lucene 4.0 the terms index works completely differently: these
> parameters don't make sense for it.
>
> On Fri, Sep 7, 2012 at 12:43 PM, Tom Burton-West 
> wrote:
> > Hello all,
> >
> > Due to multiple languages and dirty OCR, our indexes have over 2 billion
> > unique terms (
> > http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again
> ).
> > In Solr 3.6 and previous we needed to reduce the memory used for storing
> > the in-memory representation of the tii file.   We originally used the
> > termInfosIndexDivisor which affects the sampling of the tii file when
> read
> > into memory.   While this solved our problem for searching, unfortunately
> > the termInfosIndexDivisor was not read during indexing and caused OOM
> > problems once our indexes grew beyond a certain size.  See:
> > https://issues.apache.org/jira/browse/SOLR-2290.
> >
> > Has this been changed in Solr 4.0?
> >
> > The advantage of using the termInfosIndexDivisor is that it can be
> changed
> > without re-indexing, so we were able to experiment with different
> settings
> > to determine a good setting without re-indexing several terabytes of
> data.
> >
> > When we ran into problems with the memory use for the in-memory
> > representation of the tii file during indexing, we changed the
> > termIndexInterval.  The termIndexInterval is an indexing-time setting
> >  which affects the size of the tii file.  It sets the sampling of the tis
> > file that gets written to the tii file.
> >
> > In Solr 4.0 termInfosIndexDivisor has been replaced with
> > termIndexDivisor.The documentation for these two features, the
> > index-time termIndexInterval and the run-time  termIndexDivisor no longer
> > seems to be on the solr config page of the wiki and the docmentation in
> the
> > example file does not exlain what the termIndexDivisor does.
> >
> > Would it be appropriate to add these back to the wiki page?  If not,
> could
> > someone add a line or two to the comments in the Solr 4.0 example file
> > explaining what the termIndexDivisor doe?
> >
> >
> > Tom
>
>
>
> --
> lucidworks.com
>


Re: Version Migration from solr 1.3

2012-09-07 Thread Sujatha Arun
I see that  4.0 alpha has been release after 3.6.1 , so should I look at
3.5 as the most stable release currently?

Version Source :
https://issues.apache.org/jira/browse/SOLR?selectedTab=com.atlassian.jira.plugin.system.project%3Aversions-panel

Regards
Sujatha

On Fri, Sep 7, 2012 at 11:17 PM, Sujatha Arun  wrote:

> Hi ,
>
> If we are migrating from 1.3 ,which is the current stable version that we
> should be looking at  3.6.1  or  3.6.2 ?
>
> Regards
> Sujatha
>


Re: Schema model to store additional field metadata

2012-09-07 Thread Alexandre Rafalovitch
Why would you store the actual images in SOLR? There is no way to
really search the bytes of image, is there? What you probably want to
do is extract all searchable metadata out of that image, name, alt,
EXIF, etc.

And you are most likely looking at dynamic fields as the solution

1) Define *_Path, *_Size, *_Alt as a dynamic field with appropriate types
2) During indexing, write those properties as Image_1_Path,
Image_1_Size, Image_1_Alt or some such
3) Make sure that whatever search algorithm you have looks at those or
do a copyField to aggregate them into AllImage_Alt, etc.

I do something similar by extracting metadata from .DOC files with
Tika and indexing it all regardless of the actual names.

Regards,
   Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Fri, Sep 7, 2012 at 1:31 PM,   wrote:
> Hi,
>
> I want to create a Solr index of articles. Each article should have a title, 
> content, published date and an arbitrary number of images attached to. An 
> article could look like this:
>
> title: An article about Foo and Bar
> content:   This is some text about Foo and Bar.
> published: 2012.09.07T19:23
> image: 2012/09/foo.png
> image: 2012/04/images/bar.png
> image: 2012/02/abc.png
>
> I want to display the images with html -tags. But despite src I also 
> want to include an alt attribute to describe each image with additional 
> metadata. For example I want to display the article like this:
>
> An article about Foo and Bar
> This is some text about Foo and Bar.
> 
> 
> 
>
> I know that I can use a multiValued field to store the images. But how should 
> or I can store the additional src information? I have a problem finding the 
> right schema for my index.


Re: Indexing CSV files with filenames

2012-09-07 Thread edvicif
My problem is more like the left hand side of the equatation.

Is it ${f.name} or something?
On Sep 7, 2012 5:36 PM, "Rafał Kuć-3 [via Lucene]" <
ml-node+s472066n4006179...@n3.nabble.com> wrote:

> Hello!
>
> You can just pass the name of the file to the 'literal' parameter. For
> example adding
>
> literal.filename=my_file.csv
>
> would set the 'filename' field of your document with the value of
> 'my_file.csv'.
>
> --
> Regards,
>  Rafał Kuć
>  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
> ElasticSearch
>
> > Thx for the quick answer.
>
> > Can you help a little more? I don't really got the concept of literal.
>
> > How can I set a field with the source absolute path?
>
> > I mean how can I find out the parameter names?
>
> > An example will be really help full.
>
>
>
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Indexing-CSV-files-with-filenames-tp4006165p4006177.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Indexing-CSV-files-with-filenames-tp4006165p4006179.html
>  To unsubscribe from Indexing CSV files with filenames, click 
> here
> .
> NAML
>




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-CSV-files-with-filenames-tp4006165p4006194.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Robert Muir
Hi Tom: I already enhanced the javadocs about this for Lucene, putting
warnings everywhere in bold:

NOTE: This parameter does not apply to all PostingsFormat
implementations, including the default one in this release. It only
makes sense for term indexes that are implemented as a fixed gap
between terms.
NOTE: divisor settings > 1 do not apply to all PostingsFormat
implementations, including the default one in this release. It only
makes sense for terms indexes that can efficiently re-sample terms at
load time.
etc

http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/index/IndexWriterConfig.html#setTermIndexInterval%28int%29
http://lucene.apache.org/core/4_0_0-BETA/core/org/apache/lucene/index/DirectoryReader.html#open%28org.apache.lucene.store.Directory,%20int%29

In the future I expect these parameters ill be removed completely:
anything like this is specific to the codec/implementation.

In Lucene 4.0 the terms index works completely differently: these
parameters don't make sense for it.

On Fri, Sep 7, 2012 at 12:43 PM, Tom Burton-West  wrote:
> Hello all,
>
> Due to multiple languages and dirty OCR, our indexes have over 2 billion
> unique terms (
> http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again).
> In Solr 3.6 and previous we needed to reduce the memory used for storing
> the in-memory representation of the tii file.   We originally used the
> termInfosIndexDivisor which affects the sampling of the tii file when read
> into memory.   While this solved our problem for searching, unfortunately
> the termInfosIndexDivisor was not read during indexing and caused OOM
> problems once our indexes grew beyond a certain size.  See:
> https://issues.apache.org/jira/browse/SOLR-2290.
>
> Has this been changed in Solr 4.0?
>
> The advantage of using the termInfosIndexDivisor is that it can be changed
> without re-indexing, so we were able to experiment with different settings
> to determine a good setting without re-indexing several terabytes of data.
>
> When we ran into problems with the memory use for the in-memory
> representation of the tii file during indexing, we changed the
> termIndexInterval.  The termIndexInterval is an indexing-time setting
>  which affects the size of the tii file.  It sets the sampling of the tis
> file that gets written to the tii file.
>
> In Solr 4.0 termInfosIndexDivisor has been replaced with
> termIndexDivisor.The documentation for these two features, the
> index-time termIndexInterval and the run-time  termIndexDivisor no longer
> seems to be on the solr config page of the wiki and the docmentation in the
> example file does not exlain what the termIndexDivisor does.
>
> Would it be appropriate to add these back to the wiki page?  If not, could
> someone add a line or two to the comments in the Solr 4.0 example file
> explaining what the termIndexDivisor doe?
>
>
> Tom



-- 
lucidworks.com


Schema model to store additional field metadata

2012-09-07 Thread sysrq
Hi,

I want to create a Solr index of articles. Each article should have a title, 
content, published date and an arbitrary number of images attached to. An 
article could look like this:

title: An article about Foo and Bar
content:   This is some text about Foo and Bar.
published: 2012.09.07T19:23
image: 2012/09/foo.png
image: 2012/04/images/bar.png
image: 2012/02/abc.png

I want to display the images with html -tags. But despite src I also want 
to include an alt attribute to describe each image with additional metadata. 
For example I want to display the article like this:

An article about Foo and Bar
This is some text about Foo and Bar.




I know that I can use a multiValued field to store the images. But how should 
or I can store the additional src information? I have a problem finding the 
right schema for my index.


Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Tom Burton-West
Hello all,

Due to multiple languages and dirty OCR, our indexes have over 2 billion
unique terms (
http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again).
In Solr 3.6 and previous we needed to reduce the memory used for storing
the in-memory representation of the tii file.   We originally used the
termInfosIndexDivisor which affects the sampling of the tii file when read
into memory.   While this solved our problem for searching, unfortunately
the termInfosIndexDivisor was not read during indexing and caused OOM
problems once our indexes grew beyond a certain size.  See:
https://issues.apache.org/jira/browse/SOLR-2290.

Has this been changed in Solr 4.0?

The advantage of using the termInfosIndexDivisor is that it can be changed
without re-indexing, so we were able to experiment with different settings
to determine a good setting without re-indexing several terabytes of data.

When we ran into problems with the memory use for the in-memory
representation of the tii file during indexing, we changed the
termIndexInterval.  The termIndexInterval is an indexing-time setting
 which affects the size of the tii file.  It sets the sampling of the tis
file that gets written to the tii file.

In Solr 4.0 termInfosIndexDivisor has been replaced with
termIndexDivisor.The documentation for these two features, the
index-time termIndexInterval and the run-time  termIndexDivisor no longer
seems to be on the solr config page of the wiki and the docmentation in the
example file does not exlain what the termIndexDivisor does.

Would it be appropriate to add these back to the wiki page?  If not, could
someone add a line or two to the comments in the Solr 4.0 example file
explaining what the termIndexDivisor doe?


Tom


Re: Indexing CSV files with filenames

2012-09-07 Thread Rafał Kuć
Hello!

You can just pass the name of the file to the 'literal' parameter. For
example adding

literal.filename=my_file.csv

would set the 'filename' field of your document with the value of
'my_file.csv'.

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

> Thx for the quick answer.

> Can you help a little more? I don't really got the concept of literal.

> How can I set a field with the source absolute path?

> I mean how can I find out the parameter names?

> An example will be really help full.



> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-CSV-files-with-filenames-tp4006165p4006177.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Indexing CSV files with filenames

2012-09-07 Thread edvicif
Thx for the quick answer.

Can you help a little more? I don't really got the concept of literal.

How can I set a field with the source absolute path?

I mean how can I find out the parameter names?

An example will be really help full.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-CSV-files-with-filenames-tp4006165p4006177.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solrcloud setup using tomcat, single machine

2012-09-07 Thread Mark Miller


The above does not look right - you probably would want
/usr/solr/example/solr for your solrhome based on other info you give.

You also reference /usr/solr/data/conf as your conf folder, but I'd
expect it to be something like /usr/solr/example/solr/collection1/conf

 -DhostPort=8080" #might not be useful?/

The above will only work if you go to example/solr/solr.xml and change
jetty.port to hostPort. Otherwise just use jetty.port and it will set
this (regardless of your container).

Other than that, I can't spot much.

If you don't see the Cloud tab in the Admin UI, Solr does not think
it's in SolrCloud mode for some reason.

On Thu, Sep 6, 2012 at 9:04 PM, JesseBuesking  wrote:
> Hey guys!
>
> I've been attempting to get solrcloud set up on a ubuntu vm, but I believe
> I'm stuck.
>
> I've got tomcat setup, the solr war file in place, and when I browser to
> localhost:port/solr, I can see solr.  CHECK
>
> I've set the zoo.cfg to use port 5200.  I can start it up and see it's
> running (ls / shows me [zookeeper]). CHECK
>
> *Issues I'm running into*
> 1. I'm trying to get it so that the example in solr
> (example/solr/collection1/conf) will load up, however it doesn't look like
> it's working (from posts online, it looks like I should see a *Cloud* tab
> under localhost:port/solr, but it's not appearing.
> 2. Sometimes it looks like things are still trying to run on port 2181
> (default zookeeper port).
> 3. Some commands I run look like they're trying to use jetty still, even
> though I think I have tomcat set up correctly.
>
> I must admit that my background is in C#, so calling java jars passing -D
> everywhere is a bit new to me.
>
> What I'd like to do is start up a solr node using zookeeper through tomcat,
> but it seems like most guide use jetty and I'm having issues trying to
> convert to tomcat.
>
> I don't know what you might need to know to help me out, so I'm going to
> give you as much info on my setup as I can.
>
> For reference, the folder structure I've adopted (feel free to make
> recommendations) is as follows:
> /usr/solr
>   /usr/solr/data/conf # conf files
>   /usr/solr/solr4.0.0-BETA # extraction from the tar.gz
> /usr/tomcat
>   /usr/tomcat/tomcat7.0.30 #where tomcat lives
>   /usr/tomcat/tomcat7.0.30/data/solr.war # war file from the extracted
> tar.gz
>   /usr/tomcat/tomcat7.0.30/conf/Catalina/localhost/solr.xml # contains the
> following
>
>  crossContext="true">
>  value="/usr/solr/data/conf" override="true" />
> 
>
> /usr/zookeeper
>   /usr/zookeeper/zookeeper3.3.6 # zookeeper extraction
>   /usr/zookeeper/zookeeper3.3.6/data # where the data will be stored
>   /usr/zookeeper/zookeeper3.3.6/conf/zoo.cfg # contains the following
>
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial
> # synchronization phase can take
> initLimit=10
> # The number of ticks that can pass between
> # sending a request and getting an acknowledgement
> syncLimit=5
> # the directory where the snapshot is stored.
> dataDir=/usr/zookeeper/data
> # the port at which the clients will connect
> clientPort=5200
>
> I've created the file /etc/init.d/tomcat (it contains the following):
>
> # Tomcat auto-start
> #
> # description: Auto-starts tomcat
> # processname: tomcat
> # pidfile: /var/run/tomcat.pid
>
> export JAVA_HOME=/opt/java/64/jre1.7.0_07
>
> case $1 in
> start)
>/export JAVA_OPTS="$JAVA_OPTS -DnumShards=1
> -Dbootstrap_confdir=/usr/solr/example/solr/collection1/conf
> -DzkHost=localhost:520
> 0 -DhostPort=8080" #might not be useful?/
>sh /usr/tomcat/tomcat7.0.30/bin/startup.sh
> ;;
> stop)
> sh /usr/tomcat/tomcat7.0.30/bin/shutdown.sh
> ;;
> restart)
> sh /usr/tomcat/tomcat7.0.30/bin/shutdown.sh
> sh /usr/tomcat/tomcat7.0.30/bin/startup.sh
> ;;
> esac
> exit 0
>
> I've been using some of these posts as references throughout the day (I've
> been at this for several hours):
> http://outerthought.org/blog/491-ot.html
> http://blog.jesjobom.com/2012/08/configurando-solr-cloud-beta-tomcat-zookeeper-externo/
> http://www.slideshare.net/lucenerevolution/how-solrcloud-changes-the-user-experience-in-a-sharded-environment
> http://techspry.com/how-to/how-to-install-tomcat-7-and-solr-on-centos-5-5/
> http://stackoverflow.com/questions/10026014/apache-solr-configuration-with-tomcat-6-0
> ... more, but I don't wanna make this any longer than it needs to be
>
> *End goal for testing*
> On a single box (for testing), get this to happen:
> 1. a single zookeeper instance running on port 5200
> 2. a single tomcat instance running on port 8080
> 3. a single solr node running, using configs stored in zookeeper
>
> *Eventual production goal*
> 1. a 3-piece zookeeper ensemble, running on ports 5200,5201,5202
> 2. one of the following
> a. 4 solr nodes, running replicated (to allow 1 failure)
> b. 4 solr nodes, running replicated (to allow up to 2 failures)
> *. both choices should

Re: Indexing CSV files with filenames

2012-09-07 Thread Rafał Kuć
Hello!

In Solr 4.0 you will have the ability to add arbitrary field along
with all documents from a single file - 
http://wiki.apache.org/solr/UpdateCSV#literal

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

> Hi!

> I've have a set of CSV files. I wanted to index them by certain columns. But
> I also want to store the filename, where they got indexed from.

> The reason is, that the queries I want to run is to identify files.

> David



> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-CSV-files-with-filenames-tp4006165.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Indexing CSV files with filenames

2012-09-07 Thread edvicif
Hi!

I've have a set of CSV files. I wanted to index them by certain columns. But
I also want to store the filename, where they got indexed from.

The reason is, that the queries I want to run is to identify files.

David



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-CSV-files-with-filenames-tp4006165.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Access and copy lucene index data

2012-09-07 Thread Jack Krupansky
You can use the Solr admin "analysis" web page to enter a term or even a 
passage of text and see how it would be analyzed/indexed for any specified 
field or field type.


-- Jack Krupansky

-Original Message- 
From: Bill_78

Sent: Friday, September 07, 2012 11:23 AM
To: solr-user@lucene.apache.org
Subject: Access and copy lucene index data

Dear all,

Similar subjects about index data have already been post, but I would like
your advise.

I use solr analysers to process fields, like synonyms, stopwords, ... and I
cannot see the result without using a special tool (like LukeRequestHandler
for example).
I would like to copy the index data into the original field stored by solr.

Can someone tell me if such operation is possible ?
Should I modify the core of solr to do this ? I have the idea to retrieve
the index data et to copy the content into the document field value, but I
don't know exactly where to make the changes to perform this ...

Thanks in advance for your interest.

Bill_78




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Access-and-copy-lucene-index-data-tp4006167.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: SOLR 4.0 DataImport frozen or fails with WARNING: Unable to read: dataimport.properties?

2012-09-07 Thread Travis Low
Change your data-config.xml connection XML to this:



Then try again.  This keeps the driver from trying to fetch the entire
result set at the same time.

cheers,

Travis


On Fri, Sep 7, 2012 at 4:17 AM, deniz  wrote:

> Hi all,
>
> I have been trying to index my data from mysql db, but somehow  i cant
> index
> anything, and dont see any exception / error in logs, except a warning
> which
> is highlighted below...
>
> Here is my db-config's connection string:
>
>  url="jdbc:mysql://dbhost:3396/myDB" user="XXX" password="XXX" />
>
> (I can connect to the db from command line by using the above settings)
>
> and after i start dataimport i see these in the log:
>
> INFO: Starting Full Import
> Sep 07, 2012 4:08:21 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=0
> Sep 07, 2012 4:08:21 PM
> org.apache.solr.handler.dataimport.SimplePropertiesWriter
> readIndexerProperties
> *WARNING: Unable to read: dataimport.properties*
> Sep 07, 2012 4:08:21 PM org.apache.solr.handler.dataimport.JdbcDataSource$1
> call
> INFO: Creating a connection for entity user with URL:
> jdbc:mysql://10.60.1.157:3396/poppen
> Sep 07, 2012 4:08:22 PM org.apache.solr.handler.dataimport.JdbcDataSource$1
> call
> INFO: Time taken for getConnection(): 802
> Sep 07, 2012 4:08:23 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=1
> Sep 07, 2012 4:08:25 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=1
> Sep 07, 2012 4:08:27 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=0
> Sep 07, 2012 4:08:29 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=1
> Sep 07, 2012 4:08:31 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=0
> Sep 07, 2012 4:08:33 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=0
> Sep 07, 2012 4:08:36 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=1
> Sep 07, 2012 4:08:38 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=1
> Sep 07, 2012 4:08:40 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=1
> Sep 07, 2012 4:08:42 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=0
> Sep 07, 2012 4:08:44 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=0
> Sep 07, 2012 4:08:46 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=1
> Sep 07, 2012 4:08:49 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=0
> Sep 07, 2012 4:08:51 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=0
> Sep 07, 2012 4:08:53 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=0
> Sep 07, 2012 4:08:55 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=0
> Sep 07, 2012 4:08:58 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=0
> Sep 07, 2012 4:09:00 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=0
> Sep 07, 2012 4:09:02 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=0
> Sep 07, 2012 4:09:06 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=0
> Sep 07, 2012 4:09:08 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=0
> Sep 07, 2012 4:09:10 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=0
> Sep 07, 2012 4:09:12 PM org.apache.solr.core.SolrCore execute
> INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
> status=0 QTime=0
> Sep 07, 20

Re: Website (crawler for) indexing

2012-09-07 Thread Dominique Bejean
May be you can take a look at Crawl-Anywhere which have administration 
web interface, solr indexer and search web application.


www.crawl-anywhere.com

Regards.

Dominique

Le 05/09/12 17:05, Lochschmied, Alexander a écrit :

This may be a bit off topic: How do you index an existing website and control 
the data going into index?

We already have Java code to process the HTML (or XHTML) and turn it into a 
SolrJ Document (removing tags and other things we do not want in the index). We 
use SolrJ for indexing.
So I guess the question is essentially which Java crawler could be useful.

We used to use wget on command line in our publishing process, but we do no 
longer want to do that.

Thanks,
Alexander






Access and copy lucene index data

2012-09-07 Thread Bill_78
Dear all,

Similar subjects about index data have already been post, but I would like
your advise. 

I use solr analysers to process fields, like synonyms, stopwords, ... and I
cannot see the result without using a special tool (like LukeRequestHandler
for example). 
I would like to copy the index data into the original field stored by solr. 

Can someone tell me if such operation is possible ? 
Should I modify the core of solr to do this ? I have the idea to retrieve
the index data et to copy the content into the document field value, but I
don't know exactly where to make the changes to perform this ...

Thanks in advance for your interest. 

Bill_78




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Access-and-copy-lucene-index-data-tp4006167.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: use of filter queries in Lucene/Solr Alpha40 and Beta4.0

2012-09-07 Thread Erick Erickson
Thank the guys who actually fixed it!

Thanks for bringing this up, and please let us know if Yonik's patch fixes
your problem

Best
Erick

On Thu, Sep 6, 2012 at 11:39 PM, guenter.hip...@unibas.ch
 wrote:
> Erick, thanks for response!
> Our use case is very straight forward and basic.
> - no cloud infrastructure
> - XMLUpdateRequest - handler (transformed library bibliographic data which
> is pushed by the post.jar component). For deletions I used to use the solrJ
> component until two month ago but because of the difficulties I read about I
> changed back to the basic procedure with XML documents
> - around 18 million documents, no distributed shards
> - once the basic use case is stable and maintainable we are heading forward
> to the more fancy things ;-)
>
> Yonik provided a patch (https://issues.apache.org/jira/browse/SOLR-3793)
> yesterday morning. I'm going to run tests once it is part of the nightly
> builds. By now, if I'm not wrong
> (https://builds.apache.org/job/Solr-Artifacts-4.x/), the last build doesn't
> contain it.
>
> Best wishes from Basel, Günter
>
>
> On 09/07/2012 07:09 AM, Erick Erickson wrote:
>>
>> Guenter:
>>
>> Are you using SolrCloud or straight Solr? And were you updating in
>> batches (i.e. updating multiple docs at once from SolrJ by using the
>> server.add(doclist) form)?
>>
>> There was a bug in this process that caused various docs to show up
>> in various shards differently. This has been fixed in 4x, any nightly
>> build should have the fix.
>>
>> I'm absolutely grasping at straws here, but this was a weird case that
>> I happen to know about...
>>
>> Hossman:
>> of course this all goes up in smoke if you can reproduce this with any
>> recent compilation of the code.
>>
>> FWIW
>> Erick
>>
>> On Wed, Sep 5, 2012 at 11:29 PM, guenter.hip...@unibas.ch
>>  wrote:
>>>
>>> Hoss, I'm so happy you realized the problem because I was quite worried
>>> about it!!
>>>
>>> Let me know if I can provide support with testing it.
>>> The last two days I was busy with migrating a bunch of hosts which should
>>> -hopefully- be finished today.
>>> Then I have again the infrastructure for running tests
>>>
>>> Günter
>>>
>>>
>>> On 09/05/2012 11:19 PM, Chris Hostetter wrote:

 : Subject: Re: use of filter queries in Lucene/Solr Alpha40 and Beta4.0

 Günter, This is definitely strange

 The good news is, i can reproduce your problem.
 The bad news is, i can reproduce your problem - and i have no idea
 what's
 causing it.

 I've opened SOLR-3793 to try to get to the bottom of this, and included
 some basic steps to demonstrate the bug using the Solr 4.0-BETA example
 data, but i'm really not sure what the problem might be...

 https://issues.apache.org/jira/browse/SOLR-3793


 -Hoss
>>>
>>>
>>>
>>> --
>>> Universität Basel
>>> Universitätsbibliothek
>>> Günter Hipler
>>> Projekt SwissBib
>>> Schoenbeinstrasse 18-20
>>> 4056 Basel, Schweiz
>>> Tel.: + 41 (0)61 267 31 12 Fax: ++41 61 267 3103
>>> e-mailguenter.hip...@unibas.ch
>>> URL:www.swissbib.org   /http://www.ub.unibas.ch/
>>>
>
>
> --
> Universität Basel
> Universitätsbibliothek
> Günter Hipler
> Projekt SwissBib
> Schoenbeinstrasse 18-20
> 4056 Basel, Schweiz
> Tel.: + 41 (0)61 267 31 12 Fax: ++41 61 267 3103
> e-mailguenter.hip...@unibas.ch
> URL:www.swissbib.org   /http://www.ub.unibas.ch/
>


Re: groups.limit=0 in sharding core results in IllegalArgumentException

2012-09-07 Thread yriveiro
Hi, 

I have the same issue using solr 4.0-ALPHA.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/groups-limit-0-in-sharding-core-results-in-IllegalArgumentException-tp4006086p4006110.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Unexpected results in Solr 4 Pivot Faceting

2012-09-07 Thread Dotan Cohen
On Fri, Sep 7, 2012 at 5:04 PM, Yonik Seeley  wrote:
> On Fri, Sep 7, 2012 at 9:39 AM, Erik Hatcher  wrote:
>> A "trie" field probably doesn't work properly, as it indexes multiple terms 
>> per value and you'd get odd values.
>
> I don't know about pivot faceting, but all of the other types of
> faceting take this into account (hence faceting works fine on trie
> fields).
>

Thanks. I am not familiar with the trie field, but I'll look into it.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Unexpected results in Solr 4 Pivot Faceting

2012-09-07 Thread Dotan Cohen
On Fri, Sep 7, 2012 at 4:39 PM, Erik Hatcher  wrote:
>> Just to be clear, as I'm not logged onto the dev server at the moment
>> but it was implied in an earlier mail: Any field that is to be pivoted
>> on needs to be a string field? Is that documented, as I cannot find
>> that in the docs.
>
> No, it doesn't need to be a string field but whatever terms come out of 
> the analysis process are what gets faceted upon.  If it was a "text" field, 
> each word in the field would be a facet value.  A "trie" field probably 
> doesn't work properly, as it indexes multiple terms per value and you'd get 
> odd values.   Pivot faceting was initially implemented only with textual 
> terms in mind, and string is generally the desired type.
>

Thanks for the insight. I'll see how much time for experimentation I
might afford.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Unexpected results in Solr 4 Pivot Faceting

2012-09-07 Thread Yonik Seeley
On Fri, Sep 7, 2012 at 9:39 AM, Erik Hatcher  wrote:
> A "trie" field probably doesn't work properly, as it indexes multiple terms 
> per value and you'd get odd values.

I don't know about pivot faceting, but all of the other types of
faceting take this into account (hence faceting works fine on trie
fields).

-Yonik
http://lucidworks.com


Re: Unexpected results in Solr 4 Pivot Faceting

2012-09-07 Thread Erik Hatcher

On Sep 7, 2012, at 09:29 , Dotan Cohen wrote:

> On Fri, Sep 7, 2012 at 4:05 PM, Erik Hatcher  wrote:
> 
>> Ranges won't work at all pivots are purely by individual term currently.
>> 
>> If you want to pivot by ranges, and you can define those ranges during 
>> indexing, then you could make a field that represented which range each 
>> document is in.
>> 
>>  doc:
>>id: 1234
>>category: History
>>date_range_buckets: 2004/March->June
>> 
>> or something like that.  Then you could pivot on category and 
>> date_range_buckets.  It's a hacky workaround, but might just be sufficient 
>> for some cases.
>> 
> 
> Thanks. As there are other applications using the index I was hoping
> to avoid adding a redundant work-around field. But it looks like the
> best solution.
> 
> Just to be clear, as I'm not logged onto the dev server at the moment
> but it was implied in an earlier mail: Any field that is to be pivoted
> on needs to be a string field? Is that documented, as I cannot find
> that in the docs.

No, it doesn't need to be a string field but whatever terms come out of the 
analysis process are what gets faceted upon.  If it was a "text" field, each 
word in the field would be a facet value.  A "trie" field probably doesn't work 
properly, as it indexes multiple terms per value and you'd get odd values.   
Pivot faceting was initially implemented only with textual terms in mind, and 
string is generally the desired type.

Erik



Re: Unexpected results in Solr 4 Pivot Faceting

2012-09-07 Thread Dotan Cohen
On Fri, Sep 7, 2012 at 4:05 PM, Erik Hatcher  wrote:

> Ranges won't work at all pivots are purely by individual term currently.
>
> If you want to pivot by ranges, and you can define those ranges during 
> indexing, then you could make a field that represented which range each 
> document is in.
>
>   doc:
> id: 1234
> category: History
> date_range_buckets: 2004/March->June
>
> or something like that.  Then you could pivot on category and 
> date_range_buckets.  It's a hacky workaround, but might just be sufficient 
> for some cases.
>

Thanks. As there are other applications using the index I was hoping
to avoid adding a redundant work-around field. But it looks like the
best solution.

Just to be clear, as I'm not logged onto the dev server at the moment
but it was implied in an earlier mail: Any field that is to be pivoted
on needs to be a string field? Is that documented, as I cannot find
that in the docs.

Thanks!

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


Re: Solr 4.0alpha: edismax complaints on certain characters

2012-09-07 Thread Alexandre Rafalovitch
Thank you. I can confirm that moving to Beta has made that problem go away.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Thu, Sep 6, 2012 at 12:38 PM, Jack Krupansky  wrote:
> The fix in edismax was made just a few days (6/28) before the formal
> announcement of 4.0-ALPHA (7/3), but unfortunately the fix came a few days
> after the cutoff for 4.0-ALPHA (6/25).
>
> See:
> https://issues.apache.org/jira/browse/SOLR-3467
>
> (That issue should probably be annotated to indicate that it "affects"
> 4.0-ALPHA.)
>
> -- Jack Krupansky
>
> -Original Message- From: Alexandre Rafalovitch
> Sent: Thursday, September 06, 2012 10:13 AM
>
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 4.0alpha: edismax complaints on certain characters
>
> I am on 4.0 alpha. Maybe it was fixed in beta. But I am most
> definitely seeing this in edismax. If I get rid of / and use
> debugQuery, I get:
> 'responseHeader'=>{
>'status'=>0,
>'QTime'=>14,
>'params'=>{
>  'debugQuery'=>'true',
>  'indent'=>'true',
>  'q'=>'foobar',
>  'qf'=>'TitleEN DescEN',
>  'wt'=>'ruby',
>  'defType'=>'edismax'}},
>  'response'=>{'numFound'=>0,'start'=>0,'docs'=>[]
>  },
>  'debug'=>{
>'rawquerystring'=>'foobar',
>'querystring'=>'foobar',
>'parsedquery'=>'(+DisjunctionMaxQuery((DescEN:foobar |
> TitleEN:foobar)))/no_coord',
>'parsedquery_toString'=>'+(DescEN:foobar | TitleEN:foobar)',
>'explain'=>{},
>'QParser'=>'ExtendedDismaxQParser',
> 
>
> I'll check beta on my machine by tomorrow.
>
> Regards,
>   Alex.
>
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
>
>
> On Thu, Sep 6, 2012 at 10:06 AM, Jack Krupansky 
> wrote:
>>
>> That's what I was thinking, but when I tried foo/bar in Solr 3.6 and
>> 4.0-BETA it was working fine - it split the term and generated the proper
>> query without any error.
>>
>> I think the problem is if you use the default Lucene query parser, not
>> edismax. I removed &defType==edismax from my query request and the problem
>> reproduces.
>>
>> My two test queries:
>>
>> http://localhost:8983/solr/select/?debugQuery=true&defType=edismax&qf=features&q=foo/bar
>> http://localhost:8983/solr/select/?debugQuery=true&df=features&q=foo/bar
>>
>> The first works; the second fails as reported (in 4.0-BETA, but works in
>> 3.6).
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Yonik Seeley
>> Sent: Thursday, September 06, 2012 9:53 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr 4.0alpha: edismax complaints on certain characters
>>
>>
>> I believe this is caused by the regex support in
>> https://issues.apache.org/jira/browse/LUCENE-2039
>>
>> It certainly seems wrong to interpret a slash in the middle of the
>> word as the start of a regex, so I've reopened the issue.
>>
>> -Yonik
>> http://lucidworks.com
>>
>>
>> On Thu, Sep 6, 2012 at 9:34 AM, Alexandre Rafalovitch
>>  wrote:
>>>
>>>
>>> Hello,
>>>
>>> I was under the impression that edismax was supposed to be crash proof
>>> and just ignore bad syntax. But I am either misconfiguring it or hit a
>>> weird bug. I basically searched for text containing '/' and got this:
>>>
>>> {
>>>   'responseHeader'=>{
>>> 'status'=>400,
>>> 'QTime'=>9,
>>> 'params'=>{
>>>   'qf'=>'TitleEN DescEN',
>>>   'indent'=>'true',
>>>   'wt'=>'ruby',
>>>   'q'=>'foo/bar',
>>>   'defType'=>'edismax'}},
>>>   'error'=>{
>>> 'msg'=>'org.apache.lucene.queryparser.classic.ParseException:
>>> Cannot parse \'foo/bar \': Lexical error at line 1, column 9.
>>> Encountered:  after : "/bar "',
>>> 'code'=>400}}
>>>
>>> Is that normal? If it is, is there a known list of characters I need
>>> to escape or do I just have to catch the exception and tell user to
>>> not do this again?
>>>
>>> Regards,
>>>Alex.
>>>
>>> Personal blog: http://blog.outerthoughts.com/
>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>>> - Time is the quality of nature that keeps events from happening all
>>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>>> book)
>>
>>
>>
>


Re: Unexpected results in Solr 4 Pivot Faceting

2012-09-07 Thread Erik Hatcher

On Sep 7, 2012, at 08:36 , Dotan Cohen wrote:

> On Fri, Sep 7, 2012 at 12:23 PM, Erik Hatcher  wrote:
>> Pivot facets currently only work with individual terms, not ranges.
>> 
>> The response you provided does look odd in that there are duplicate 
>> timestamps listed, but pivots were only implemented for textual (string 
>> being the most common type) fields initially.
>> 
> 
> I see, thanks. Other than creating an additional rounded-off timestamp
> field, are there any other solutions? Might ranges work if instead of
> a timestamp we used a real DateTime field?
> 
> In any case, in order to pivot on the timestamp, will I have to change
> its type to string?

Ranges won't work at all pivots are purely by individual term currently.

If you want to pivot by ranges, and you can define those ranges during 
indexing, then you could make a field that represented which range each 
document is in.

  doc:  
id: 1234
category: History
date_range_buckets: 2004/March->June

or something like that.  Then you could pivot on category and 
date_range_buckets.  It's a hacky workaround, but might just be sufficient for 
some cases.

Erik



Re: Unexpected results in Solr 4 Pivot Faceting

2012-09-07 Thread Dotan Cohen
On Fri, Sep 7, 2012 at 12:23 PM, Erik Hatcher  wrote:
> Pivot facets currently only work with individual terms, not ranges.
>
> The response you provided does look odd in that there are duplicate 
> timestamps listed, but pivots were only implemented for textual (string being 
> the most common type) fields initially.
>

I see, thanks. Other than creating an additional rounded-off timestamp
field, are there any other solutions? Might ranges work if instead of
a timestamp we used a real DateTime field?

In any case, in order to pivot on the timestamp, will I have to change
its type to string?


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com


RE: Is Boilerpipe usable through Solr ExtractingUpdateHandler or the DIH?

2012-09-07 Thread Markus Jelsma
It works indeed:
https://issues.apache.org/jira/browse/SOLR-3808
 
 
-Original message-
> From:Markus Jelsma 
> Sent: Fri 07-Sep-2012 10:40
> To: solr-user@lucene.apache.org
> Subject: RE: Is Boilerpipe usable through Solr ExtractingUpdateHandler or the 
> DIH?
> 
> Hi,
> 
> It should not be so hard but it looks like the current SolrContentHandler 
> builds up the document via SAX-events. You could pass a 
> BoilerpipeContentHandler((ContentHandler)parsingHandler, BoilerpipeExtractor) 
> to the parser in ExtractingDocumentLoader.java. It should work.
> 
> Markus
> 
>  
>  
> -Original message-
> > From:Lance Norskog 
> > Sent: Thu 06-Sep-2012 05:51
> > To: solr-user@lucene.apache.org
> > Subject: Is Boilerpipe usable through Solr ExtractingUpdateHandler or the 
> > DIH?
> > 
> > Tika integrated Boilerpipe a few releases back. Is it possible to invoke it 
> > when using the ExtractingUpdateHandler (simple Tika) or the 
> > DataImportHandler? 
> > 
> > http://code.google.com/p/boilerpipe/ 
> > 
> > 
> > 
> 


Re: SOLR 4.0 / Jetty Security Set Up

2012-09-07 Thread dan sutton
Hi,

If like most people you have application server(s) in front of solr,
the simplest and most secure option is to bind solr to a local address
(192.168.* or 10.0.0.*). The app server talks to solr via the local
(a.k.a blackhole) ip address that no-one from outside can ever access
as it's not routable.

Plus you then don't need to employ authentication which can slow down
responses as you're ONLY employing access control.This is what we do
for access to 5 solr servers.

Cheers,
Dan

On Wed, Sep 5, 2012 at 10:51 AM, Paul Codman  wrote:
> First time Solr user and I am loving it! I have a standard Solr 4 set up
> running under Jetty. The instructions in the Wiki do not seem to apply to
> Solr 4 (eg mortbay references / section to uncomment not present in xml
> file / etc) - could someone please advise on steps required to secure Solr
> 4 and can someone confirm that security operates in relation to new Admin
> interface. Thanks in advance.


Marco Scalone está ausente de la oficina.

2012-09-07 Thread Marco Scalone

Estaré ausente de la oficina desde el Vie 07/09/2012  y no volveré hasta el
Jue 20/09/2012 .

Responderé a su mensaje cuando regrese.



Re: Unexpected results in Solr 4 Pivot Faceting

2012-09-07 Thread Erik Hatcher
Pivot facets currently only work with individual terms, not ranges.

The response you provided does look odd in that there are duplicate timestamps 
listed, but pivots were only implemented for textual (string being the most 
common type) fields initially.

Erik

On Sep 6, 2012, at 19:04 , Dotan Cohen wrote:

> In Solr 4, using this faceting pivot:
> 
> &facet=true
> &facet.field=provider
> &facet.range=timestamp
> &f.timestamp.facet.range.start=1346968500
> &f.timestamp.facet.range.end=1346969000
> &f.timestamp.facet.range.gap=100
> &facet.pivot=timestamp,provider
> 
> I am getting facet results with the timestamp -1325191116. In fact,
> most of them have that date. What am I doing wrong? I am trying to get
> a table of facets for each provider for each time period:
> 1346968500 - 1346968600
> 1346968600 - 1346968700
> 1346968700 - 1346968800
> 1346968800 - 1346968900
> 1346968900 - 1346969000
> 
> Assuming 12 providers and 5 time periods, I should be getting 60
> results. But I'm not, I am getting repeated results like this:
> 
> 52 
> 53   
> 54 
> 55timestamp 
> 56-1325191116 
> 5717 
> 58   
> 59 
> 60 
> 61timestamp 
> 62-1325191116 
> 6317 
> 64   
> 65 
> 66 
> 67timestamp 
> 68-1325191116 
> 6915 
> 70   
> 71 
> 72 
> 73timestamp 
> 74-1325191116 
> 7514 
> 76   
> 77 
> 78 
> 79timestamp 
> 80-1325191116 
> 8114 
> 82   
> 83 
> 
> Note that I am connecting to a Solr 4 instance. Thanks for providing
> any insight.
> 
> 
> 
> 
> 
> -- 
> Dotan Cohen
> 
> http://gibberish.co.il
> http://what-is-what.com



groups.limit=0 in sharding core results in IllegalArgumentException

2012-09-07 Thread mechravi25
Hi,

Im using solr 3.6.1 version. I have kept corex as the common core i.e. I ve
used the sharding concept on this core to get the indexed data from all the
other cores.

Here, If i use grouping with groups.limit=0, its resulting in the following
exception


numHits must be > 0; please use TotalHitCountCollector if you just need the
total hit count  java.lang.IllegalArgumentException: numHits must be > 0;
please use TotalHitCountCollector if you just need the total hit count  
at org.apache.lucene.search.TopFieldCollector.create
at
org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector.
at org.apache.lucene.search.grouping.TermSecondPassGroupingCollector. 
at
org.apache.solr.search.grouping.distributed.command.TopGroupsFieldCommand.create
at org.apache.solr.search.grouping.CommandHandler.execute   
at org.apache.solr.handler.component.QueryComponent.process
at org.apache.solr.handler.component.SearchHandler.handleRequestBody
at org.apache.solr.handler.RequestHandlerBase.handleRequest
at org.apache.solr.core.SolrCore.execute
at org.apache.solr.servlet.SolrDispatchFilter.execute
at org.apache.solr.servlet.SolrDispatchFilter.doFilter  
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter
at org.mortbay.jetty.servlet.ServletHandler.handle  
at org.mortbay.jetty.security.SecurityHandler.handle
at org.mortbay.jetty.servlet.SessionHandler.handle
at org.mortbay.jetty.handler.ContextHandler.handle
at org.mortbay.jetty.webapp.WebAppContext.handle
at org.mortbay.jetty.handler.ContextHandlerCollection.handle
at org.mortbay.jetty.handler.HandlerCollection.handle
at org.mortbay.jetty.handler.HandlerWrapper.handle
at org.mortbay.jetty.Server.handle

But, the same is working fine if I leave out groups.limit or give any other
non zero digit as value for the same.

I also applied the patch file found in the below url and its still having
the same problem.
https://issues.apache.org/jira/browse/SOLR-2923

Can you tell me if this can be rectified in any way Or If I am missing out
anything?

Thank You



--
View this message in context: 
http://lucene.472066.n3.nabble.com/groups-limit-0-in-sharding-core-results-in-IllegalArgumentException-tp4006086.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR 4.0 / Jetty Security Set Up

2012-09-07 Thread Tomas Zerolo
On Fri, Sep 07, 2012 at 08:50:58AM +0200, Paul Libbrecht wrote:
> Erick,
> 
> I think that should be described differently...
> You need to set-up protected access for some paths.
> /update is one of them.
> And you could make this protected at the jetty level or using Apache proxies 
> and rewrites.

So you'd advise always putting an Apache in front of Jetty?

> Probably /select should be kept open

As far as I understand [1], it's better to close /select (because you can
easily make an admin or update out of it, by e.g. doing a /select?qt=/admin
or /select?qt=/update)

>  but you need to evaluate if that can get 
> you
> in DoS attacks if there are too big selects. If that is the case, you're left 
> to
> programme an interface all by yourself which limits and fetches from solr, or 
> which
> lives inside solr (a query component) and throws if things are too big.

[1] 

Regads
-- 
Tomás Zerolo
Axel Springer AG
Axel Springer media Systems
BILD Produktionssysteme
Axel-Springer-Straße 65
10888 Berlin
Tel.: +49 (30) 2591-72875
tomas.zer...@axelspringer.de
www.axelspringer.de

Axel Springer AG, Sitz Berlin, Amtsgericht Charlottenburg, HRB 4998
Vorsitzender des Aufsichtsrats: Dr. Giuseppe Vita
Vorstand: Dr. Mathias Döpfner (Vorsitzender)
Jan Bayer, Ralph Büchi, Lothar Lanz, Dr. Andreas Wiele


RE: Is Boilerpipe usable through Solr ExtractingUpdateHandler or the DIH?

2012-09-07 Thread Markus Jelsma
Hi,

It should not be so hard but it looks like the current SolrContentHandler 
builds up the document via SAX-events. You could pass a 
BoilerpipeContentHandler((ContentHandler)parsingHandler, BoilerpipeExtractor) 
to the parser in ExtractingDocumentLoader.java. It should work.

Markus

 
 
-Original message-
> From:Lance Norskog 
> Sent: Thu 06-Sep-2012 05:51
> To: solr-user@lucene.apache.org
> Subject: Is Boilerpipe usable through Solr ExtractingUpdateHandler or the DIH?
> 
> Tika integrated Boilerpipe a few releases back. Is it possible to invoke it 
> when using the ExtractingUpdateHandler (simple Tika) or the 
> DataImportHandler? 
> 
> http://code.google.com/p/boilerpipe/ 
> 
> 
> 


SOLR 4.0 DataImport frozen or fails with WARNING: Unable to read: dataimport.properties?

2012-09-07 Thread deniz
Hi all,

I have been trying to index my data from mysql db, but somehow  i cant index
anything, and dont see any exception / error in logs, except a warning which
is highlighted below...

Here is my db-config's connection string:



(I can connect to the db from command line by using the above settings)

and after i start dataimport i see these in the log:

INFO: Starting Full Import
Sep 07, 2012 4:08:21 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=0 
Sep 07, 2012 4:08:21 PM
org.apache.solr.handler.dataimport.SimplePropertiesWriter
readIndexerProperties
*WARNING: Unable to read: dataimport.properties*
Sep 07, 2012 4:08:21 PM org.apache.solr.handler.dataimport.JdbcDataSource$1
call
INFO: Creating a connection for entity user with URL:
jdbc:mysql://10.60.1.157:3396/poppen
Sep 07, 2012 4:08:22 PM org.apache.solr.handler.dataimport.JdbcDataSource$1
call
INFO: Time taken for getConnection(): 802
Sep 07, 2012 4:08:23 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=1 
Sep 07, 2012 4:08:25 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=1 
Sep 07, 2012 4:08:27 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=0 
Sep 07, 2012 4:08:29 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=1 
Sep 07, 2012 4:08:31 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=0 
Sep 07, 2012 4:08:33 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=0 
Sep 07, 2012 4:08:36 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=1 
Sep 07, 2012 4:08:38 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=1 
Sep 07, 2012 4:08:40 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=1 
Sep 07, 2012 4:08:42 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=0 
Sep 07, 2012 4:08:44 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=0 
Sep 07, 2012 4:08:46 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=1 
Sep 07, 2012 4:08:49 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=0 
Sep 07, 2012 4:08:51 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=0 
Sep 07, 2012 4:08:53 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=0 
Sep 07, 2012 4:08:55 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=0 
Sep 07, 2012 4:08:58 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=0 
Sep 07, 2012 4:09:00 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=0 
Sep 07, 2012 4:09:02 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=0 
Sep 07, 2012 4:09:06 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=0 
Sep 07, 2012 4:09:08 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=0 
Sep 07, 2012 4:09:10 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=0 
Sep 07, 2012 4:09:12 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=0 
Sep 07, 2012 4:09:16 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=0 
Sep 07, 2012 4:09:18 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=0 
Sep 07, 2012 4:09:20 PM org.apache.solr.core.SolrCore execute
INFO: [collection1] webapp=/solr path=/dataimport params={command=status}
status=0 QTime=1 
Sep 07