subject:"RE\: Data Import Handler"

Re: Data Import Handler (DIH) - Installing and running

2020-12-23 Thread Erick Erickson

Have you done what the message says and looked at your Solr log? If so,
what information is there?

> On Dec 23, 2020, at 5:13 AM, DINSD | SPAutores 
>  wrote:
> 
> Hi,
> 
> I'm trying to install the package "data-import-handler", since it was 
> discontinued from core SolR distro.
> 
> https://github.com/rohitbemax/dataimporthandler
> 
> However, as soon as the first command is carried out
> 
> solr -c -Denable.packages=true
> 
> I get this screen in web interface
> 
> 
> 
> Has anyone been through this, or have any idea why it's happening ?
> 
> Thanks for any help
> Rui Pimentel
> 
> 
> 
> DINSD - Departamento de Informática / SPA Digital
> Av. Duque de Loulé, 31 - 1069-153 Lisboa  PORTUGAL
> T (+ 351) 21 359 44 36 / (+ 351) 21 359 44 00  F (+ 351) 21 353 02 57
>  informat...@spautores.pt
>  www.SPAutores.pt
> 
> Please consider the environment before printing this email 
> 
> Esta mensagem electrónica, incluindo qualquer dos seus anexos, contém 
> informação PRIVADA, CONFIDENCIAL e de DIVULGAÇÃO PROIBIDA,e destina-se 
> unicamente à pessoa e endereço electrónico acima indicados. Se não for o 
> destinatário desta mensagem, agradecemos que a elimine e nos comunique de 
> imediato através do telefone  +351 21 359 44 00 ou por email para: 
> ge...@spautores.pt 
> 
> This electronic mail transmission including any attachment hereof, contains 
> information that is PRIVATE, CONFIDENTIAL and PROTECTED FROM DISCLOSURE, and 
> it is only for the use of the person and the e-mail address above indicated. 
> If you have received this electronic mail transmission in error, please 
> destroy it and notify us immediately through the telephone number  +351 21 
> 359 44 00 or at the e-mail address:  ge...@spautores.pt
>

Re: data import handler deprecated?

2020-11-30 Thread Dmitri Maziuk


On 11/30/2020 7:50 AM, David Smiley wrote:

Yes, absolutely to what Eric said.  We goofed on news / release highlights
on how to communicate what's happening in Solr.  From a Solr insider point
of view, we are "deprecating" because strictly speaking, the code isn't in
our codebase any longer.  From a user point of view (the audience of news /
release notes), the functionality has *moved*.


Just FYI, there is the dih 8.7.0 jar in 
repo1.maven.org/maven2/org/apache/solr -- whereas the github build is on 
8.6.0.


Dima

Re: data import handler deprecated?

2020-11-30 Thread David Smiley

Yes, absolutely to what Eric said.  We goofed on news / release highlights
on how to communicate what's happening in Solr.  From a Solr insider point
of view, we are "deprecating" because strictly speaking, the code isn't in
our codebase any longer.  From a user point of view (the audience of news /
release notes), the functionality has *moved*.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Nov 30, 2020 at 8:04 AM Eric Pugh 
wrote:

> You don’t need to abandon DIH right now….   You can just use the Github
> hosted version….   The more people who use it, the better a community it
> will form around it!It’s a bit chicken and egg, since no one is
> actively discussing it, submitting PR’s etc, it may languish.   If you use
> it, and test it, and support other community folks using it, then it will
> continue on!
>
>
>
> > On Nov 29, 2020, at 12:12 PM, Dmitri Maziuk 
> wrote:
> >
> > On 11/29/2020 10:32 AM, Erick Erickson wrote:
> >
> >> And I absolutely agree with Walter that the DB is often where
> >> the bottleneck lies. You might be able to
> >> use multiple threads and/or processes to query the
> >> DB if that’s the case and you can find some kind of partition
> >> key.
> >
> > IME the difficult part has always been dealing with incremental updates,
> if we were to roll our own, my vote would be for a database trigger that
> does a POST in whichever language the DBMS likes.
> >
> > But this has not been a part of our "solr 6.5 update" project until now.
> >
> > Thanks everyone,
> > Dima
>
> ___
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 |
> http://www.opensourceconnections.com <
> http://www.opensourceconnections.com/> | My Free/Busy <
> http://tinyurl.com/eric-cal>
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless of
> whether attachments are marked as such.
>
>

Re: data import handler deprecated?

2020-11-30 Thread Eric Pugh

You don’t need to abandon DIH right now….   You can just use the Github hosted 
version….   The more people who use it, the better a community it will form 
around it!It’s a bit chicken and egg, since no one is actively discussing 
it, submitting PR’s etc, it may languish.   If you use it, and test it, and 
support other community folks using it, then it will continue on!

> On Nov 29, 2020, at 12:12 PM, Dmitri Maziuk  wrote:
> 
> On 11/29/2020 10:32 AM, Erick Erickson wrote:
> 
>> And I absolutely agree with Walter that the DB is often where
>> the bottleneck lies. You might be able to
>> use multiple threads and/or processes to query the
>> DB if that’s the case and you can find some kind of partition
>> key.
> 
> IME the difficult part has always been dealing with incremental updates, if 
> we were to roll our own, my vote would be for a database trigger that does a 
> POST in whichever language the DBMS likes.
> 
> But this has not been a part of our "solr 6.5 update" project until now.
> 
> Thanks everyone,
> Dima

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com  | 
My Free/Busy   
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

Re: data import handler deprecated?

2020-11-29 Thread Dmitri Maziuk


On 11/29/2020 10:32 AM, Erick Erickson wrote:


And I absolutely agree with Walter that the DB is often where
the bottleneck lies. You might be able to
use multiple threads and/or processes to query the
DB if that’s the case and you can find some kind of partition
key.


IME the difficult part has always been dealing with incremental updates, 
if we were to roll our own, my vote would be for a database trigger that 
does a POST in whichever language the DBMS likes.


But this has not been a part of our "solr 6.5 update" project until now.

Thanks everyone,
Dima

Re: data import handler deprecated?

2020-11-29 Thread Erick Erickson

If you like Java instead of Python, here’s a skeletal program:

https://lucidworks.com/post/indexing-with-solrj/

It’s simple and single-threaded, but could serve as a basis for
something along the lines that Walter suggests.

And I absolutely agree with Walter that the DB is often where
the bottleneck lies. You might be able to
use multiple threads and/or processes to query the
DB if that’s the case and you can find some kind of partition
key.

You also might (and it depends on the Solr version) be able,
to wrap a jdbc stream in an update decorator.

https://lucene.apache.org/solr/guide/8_0/stream-source-reference.html

https://lucene.apache.org/solr/guide/8_0/stream-decorator-reference.html

Best,
Erick

> On Nov 29, 2020, at 3:04 AM, Walter Underwood  wrote:
> 
> I recommend building an outboard loader, like I did a dozen years ago for
> Solr 1.3 (before DIH) and did again recently. I’m glad to send you my Python
> program, though it reads from a JSONL file, not a database.
> 
> Run a loop fetching records from a database. Put each record into a 
> synchronized
> (thread-safe) queue. Run multiple worker threads, each pulling records from 
> the
> queue, batching them up, and sending them to Solr. For maximum indexing speed
> (at the expense of query performance), count the number of CPUs per shard 
> leader
> and run two worker threads per CPU.
> 
> Adjust the batch size to be maybe 10k to 50k bytes. That might be 20 to 1000 
> documents, depending on the content.
> 
> With this setup, your database will probably be your bottleneck. I’ve had this
> index a million (small) documents per minute to a multi-shard cluster, from a 
> JSONL
> file on local disk.
> 
> Also, don’t worry about finding the leaders and sending the right document to
> the right shard. I just throw the batches at the load balancer and let Solr 
> figure
> it out. That is super simple and amazingly fast.
> 
> If you are doing big batches, building a dumb ETL system with JSONL files in 
> Amazon S3 has some real advantages. It allows loading prod data into a test
> cluster for load benchmarks, for example. Also good for disaster recovery, 
> just
> load the recent batches from S3. Want to know exactly which documents were
> in the index in October? Look at the batches in S3.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Nov 28, 2020, at 6:23 PM, matthew sporleder  wrote:
>> 
>> I went through the same stages of grief that you are about to start
>> but (luckily?) my core dataset grew some weird cousins and we ended up
>> writing our own indexer to join them all together/do partial
>> updates/other stuff beyond DIH.  It's not difficult to upload docs but
>> is definitely slower so far.  I think there is a bit of a 'clean core'
>> focus going on in solr-land right now and DIH is easy(!) but it's also
>> easy to hit its limits (atomic/partial updates?  wtf is an "entity?"
>> etc) so anyway try to be happy that you are aware of it now.
>> 
>> On Sat, Nov 28, 2020 at 7:41 PM Dmitri Maziuk  
>> wrote:
>>> 
>>> On 11/28/2020 5:48 PM, matthew sporleder wrote:
>>> 
 ...  The bottom of
 that github page isn't hopeful however :)
>>> 
>>> Yeah, "works with MariaDB" is a particularly bad way of saying "BYO JDBC
>>> JAR" :)
>>> 
>>> It's a more general queston though, what is the path forward for users
>>> who with data in two places? Hope that a community-maintained plugin
>>> will still be there tomorrow? Dump our tables to CSV (and POST them) and
>>> roll our own delta-updates logic? Or are we to choose one datastore and
>>> drop the other?
>>> 
>>> Dima
>

Re: data import handler deprecated?

2020-11-29 Thread Walter Underwood

I recommend building an outboard loader, like I did a dozen years ago for
Solr 1.3 (before DIH) and did again recently. I’m glad to send you my Python
program, though it reads from a JSONL file, not a database.

Run a loop fetching records from a database. Put each record into a synchronized
(thread-safe) queue. Run multiple worker threads, each pulling records from the
queue, batching them up, and sending them to Solr. For maximum indexing speed
(at the expense of query performance), count the number of CPUs per shard leader
and run two worker threads per CPU.

Adjust the batch size to be maybe 10k to 50k bytes. That might be 20 to 1000 
documents, depending on the content.

With this setup, your database will probably be your bottleneck. I’ve had this
index a million (small) documents per minute to a multi-shard cluster, from a 
JSONL
file on local disk.

Also, don’t worry about finding the leaders and sending the right document to
the right shard. I just throw the batches at the load balancer and let Solr 
figure
it out. That is super simple and amazingly fast.

If you are doing big batches, building a dumb ETL system with JSONL files in 
Amazon S3 has some real advantages. It allows loading prod data into a test
cluster for load benchmarks, for example. Also good for disaster recovery, just
load the recent batches from S3. Want to know exactly which documents were
in the index in October? Look at the batches in S3.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 28, 2020, at 6:23 PM, matthew sporleder  wrote:
> 
> I went through the same stages of grief that you are about to start
> but (luckily?) my core dataset grew some weird cousins and we ended up
> writing our own indexer to join them all together/do partial
> updates/other stuff beyond DIH.  It's not difficult to upload docs but
> is definitely slower so far.  I think there is a bit of a 'clean core'
> focus going on in solr-land right now and DIH is easy(!) but it's also
> easy to hit its limits (atomic/partial updates?  wtf is an "entity?"
> etc) so anyway try to be happy that you are aware of it now.
> 
> On Sat, Nov 28, 2020 at 7:41 PM Dmitri Maziuk  wrote:
>> 
>> On 11/28/2020 5:48 PM, matthew sporleder wrote:
>> 
>>> ...  The bottom of
>>> that github page isn't hopeful however :)
>> 
>> Yeah, "works with MariaDB" is a particularly bad way of saying "BYO JDBC
>> JAR" :)
>> 
>> It's a more general queston though, what is the path forward for users
>> who with data in two places? Hope that a community-maintained plugin
>> will still be there tomorrow? Dump our tables to CSV (and POST them) and
>> roll our own delta-updates logic? Or are we to choose one datastore and
>> drop the other?
>> 
>> Dima

Re: data import handler deprecated?

2020-11-28 Thread matthew sporleder

I went through the same stages of grief that you are about to start
but (luckily?) my core dataset grew some weird cousins and we ended up
writing our own indexer to join them all together/do partial
updates/other stuff beyond DIH.  It's not difficult to upload docs but
is definitely slower so far.  I think there is a bit of a 'clean core'
focus going on in solr-land right now and DIH is easy(!) but it's also
easy to hit its limits (atomic/partial updates?  wtf is an "entity?"
etc) so anyway try to be happy that you are aware of it now.

On Sat, Nov 28, 2020 at 7:41 PM Dmitri Maziuk  wrote:
>
> On 11/28/2020 5:48 PM, matthew sporleder wrote:
>
> > ...  The bottom of
> > that github page isn't hopeful however :)
>
> Yeah, "works with MariaDB" is a particularly bad way of saying "BYO JDBC
> JAR" :)
>
> It's a more general queston though, what is the path forward for users
> who with data in two places? Hope that a community-maintained plugin
> will still be there tomorrow? Dump our tables to CSV (and POST them) and
> roll our own delta-updates logic? Or are we to choose one datastore and
> drop the other?
>
> Dima

Re: data import handler deprecated?

2020-11-28 Thread Dmitri Maziuk


On 11/28/2020 5:48 PM, matthew sporleder wrote:


...  The bottom of
that github page isn't hopeful however :)


Yeah, "works with MariaDB" is a particularly bad way of saying "BYO JDBC 
JAR" :)


It's a more general queston though, what is the path forward for users 
who with data in two places? Hope that a community-maintained plugin 
will still be there tomorrow? Dump our tables to CSV (and POST them) and 
roll our own delta-updates logic? Or are we to choose one datastore and 
drop the other?


Dima

Re: data import handler deprecated?

2020-11-28 Thread matthew sporleder

https://solr.cool/#utilities -> https://github.com/rohitbemax/dataimporthandler

You can import it in the many new/novel ways to add things to a solr
install and it should work like always (apparently).  The bottom of
that github page isn't hopeful however :)

On Sat, Nov 28, 2020 at 5:21 PM Dmitri Maziuk  wrote:
>
> Hi all,
>
> trying to set up solr-8.7.0, contrib/dataimporthandler/README.txt says
> this module is deprecated as of 8.6 and scheduled for removal in 9.0.
>
> How do we pull data out of our relational database in 8.7+?
>
> TIA
> Dima

Re: Data Import Handler - Concurrent Entity Importing

2020-05-13 Thread ART GALLERY

check out the videos on this website TROO.TUBE don't be such a
sheep/zombie/loser/NPC. Much love!
https://troo.tube/videos/watch/aaa64864-52ee-4201-922f-41300032f219

On Tue, May 5, 2020 at 1:58 PM Mikhail Khludnev  wrote:
>
> Hello, James.
>
> DataImportHandler has a lock preventing concurrent execution. If you need
> to run several imports in parallel at the same core, you need to duplicate
> "/dataimport" handlers definition in solrconfig.xml. Thus, you can run them
> in parallel. Regarding schema, I prefer the latter but mileage may vary.
>
> --
> Mikhail.
>
> On Tue, May 5, 2020 at 6:39 PM James Greene 
> wrote:
>
> > Hello, I'm new to the group here so please excuse me if I do not have the
> > etiquette down yet.
> >
> > Is it possible to have multiple entities (customer configurable, up to 40
> > atm) in a DIH configuration to be imported at once?  Right now I have
> > multiple root entities in my configuration but they get indexes
> > sequentially and this means the entities that are last are always delayed
> > hitting the index.
> >
> > I'm trying to migrate an existing setup (solr 6.6) that utilizes a
> > different collection for each "entity type" into a single collection (solr
> > 8.4) to get around some of the hurdles faced when needing to have searches
> > that require multiple block joins and currently does not work going cross
> > core.
> >
> > I'm also wondering if it is better to fully qualify a field name or use two
> > different fields for performing the "same" search.  i.e:
> >
> >
> > {
> > type_A_status; Active
> > type_A_value: Test
> > }
> > vs
> > {
> > type: A
> > status: Active
> > value: Test
> > }
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev

Re: Data Import Handler - Concurrent Entity Importing

2020-05-05 Thread Mikhail Khludnev

Hello, James.

DataImportHandler has a lock preventing concurrent execution. If you need
to run several imports in parallel at the same core, you need to duplicate
"/dataimport" handlers definition in solrconfig.xml. Thus, you can run them
in parallel. Regarding schema, I prefer the latter but mileage may vary.

--
Mikhail.

On Tue, May 5, 2020 at 6:39 PM James Greene 
wrote:

> Hello, I'm new to the group here so please excuse me if I do not have the
> etiquette down yet.
>
> Is it possible to have multiple entities (customer configurable, up to 40
> atm) in a DIH configuration to be imported at once?  Right now I have
> multiple root entities in my configuration but they get indexes
> sequentially and this means the entities that are last are always delayed
> hitting the index.
>
> I'm trying to migrate an existing setup (solr 6.6) that utilizes a
> different collection for each "entity type" into a single collection (solr
> 8.4) to get around some of the hurdles faced when needing to have searches
> that require multiple block joins and currently does not work going cross
> core.
>
> I'm also wondering if it is better to fully qualify a field name or use two
> different fields for performing the "same" search.  i.e:
>
>
> {
> type_A_status; Active
> type_A_value: Test
> }
> vs
> {
> type: A
> status: Active
> value: Test
> }
>


-- 
Sincerely yours
Mikhail Khludnev

Re: data-import-handler for solr-7.5.0

2018-10-02 Thread Alexandre Rafalovitch

Ok, so then you can switch to debug mode and keep trying to figure it
out. Also try BinFileDataSource or URLDataSource, maybe it will have
an easier way.

Or using relative path (example:
https://github.com/arafalov/solr-apachecon2018-presentation/blob/master/configsets/pets-final/pets-data-config.xml).

Regards,
   Alex.
On Tue, 2 Oct 2018 at 12:46, Martin Frank Hansen (MHQ)  wrote:
>
> Thanks for the info, the UI looks interesting... It does read the data-config 
> correctly, so the problem is probably in this file.
>
> Martin Frank Hansen, Senior Data Analytiker
>
> Data, IM & Analytics
>
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail m...@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
> -Oprindelig meddelelse-
> Fra: Alexandre Rafalovitch 
> Sendt: 2. oktober 2018 18:18
> Til: solr-user 
> Emne: Re: data-import-handler for solr-7.5.0
>
> Admin UI for DIH will show you the config file read. So, if nothing is there, 
> the path is most likely the issue
>
> You can also provide or update the configuration right in UI if you enable 
> debug.
>
> Finally, the config file is reread on every invocation, so you don't need to 
> restart the core after changing it.
>
> Hope this helps,
>Alex.
> On Tue, 2 Oct 2018 at 11:45, Jan Høydahl  wrote:
> >
> > > url="C:/Users/z6mhq/Desktop/data_import/nh_test.xml"
> >
> > Have you tried url="C:\\Users\\z6mhq/Desktop\\data_import\\nh_test.xml" ?
> >
> > --
> > Jan Høydahl, search solution architect Cominvent AS -
> > www.cominvent.com
> >
> > > 2. okt. 2018 kl. 17:15 skrev Martin Frank Hansen (MHQ) :
> > >
> > > Hi,
> > >
> > > I am having some problems getting the data-import-handler in Solr to 
> > > work. I have tried a lot of things but I simply get no response from 
> > > Solr, not even an error.
> > >
> > > When calling the API:
> > > http://localhost:8983/solr/nh/dataimport?command=full-import
> > > {
> > >  "responseHeader":{
> > >"status":0,
> > >"QTime":38},
> > >  "initArgs":[
> > >"defaults",[
> > >  "config","C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml"]],
> > >  "command":"full-import",
> > >  "status":"idle",
> > >  "importResponse":"",
> > >  "statusMessages":{}}
> > >
> > > The data looks like this:
> > >
> > > 
> > >  
> > > 2165432
> > > 5  
> > >
> > >  
> > > 28548113
> > > 89   
> > >
> > >
> > > The data-config file looks like this:
> > >
> > > 
> > >  
> > >
> > >   > >name="xml"
> > >pk="id"
> > >processor="XPathEntityProcessor"
> > >stream="true"
> > >forEach="/journal/doc"
> > >url="C:/Users/z6mhq/Desktop/data_import/nh_test.xml"
> > >transformer="RegexTransformer,TemplateTransformer"
> > >>
> > >
> > >
> > >
> > >  
> > >  
> > > 
> > >
> > > And I referenced the jar files in the solr-config.xml as well as adding 
> > > the request-handler by adding the following lines:
> > >
> > >  > > regex="solr-dataimporthandler-\d.*\.jar" />  > > dir="${solr.install.dir:../../../..}/dist/"
> > > regex="solr-dataimporthandler-extras-\d.*\.jar" />
> > >
> > >
> > >  > > class="org.apache.solr.handler.dataimport.DataImportHandler">
> > >
> > >   > > name="config">C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml
> > >
> > >  
> > >
> > > I am running a core residing in the folder 
> > > “C:/Users/z6mhq/Desktop/nh/nh/conf” while the Solr installation is in 
> > > “C:/Users/z6mhq/Documents/solr-7.5.0”.
> > >
> > > I really hope that someone can spot my mistake…
> > >
> > > Thanks in advance.
> > >
> > > Martin Frank Hansen
> > >
> > >
> > > Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder 
> > > du KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der 
> > > fortæller,

Re: data-import-handler for solr-7.5.0

2018-10-02 Thread Alexandre Rafalovitch

Admin UI for DIH will show you the config file read. So, if nothing is
there, the path is most likely the issue

You can also provide or update the configuration right in UI if you
enable debug.

Finally, the config file is reread on every invocation, so you don't
need to restart the core after changing it.

Hope this helps,
   Alex.
On Tue, 2 Oct 2018 at 11:45, Jan Høydahl  wrote:
>
> > url="C:/Users/z6mhq/Desktop/data_import/nh_test.xml"
>
> Have you tried url="C:\\Users\\z6mhq/Desktop\\data_import\\nh_test.xml" ?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 2. okt. 2018 kl. 17:15 skrev Martin Frank Hansen (MHQ) :
> >
> > Hi,
> >
> > I am having some problems getting the data-import-handler in Solr to work. 
> > I have tried a lot of things but I simply get no response from Solr, not 
> > even an error.
> >
> > When calling the API: 
> > http://localhost:8983/solr/nh/dataimport?command=full-import
> > {
> >  "responseHeader":{
> >"status":0,
> >"QTime":38},
> >  "initArgs":[
> >"defaults",[
> >  "config","C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml"]],
> >  "command":"full-import",
> >  "status":"idle",
> >  "importResponse":"",
> >  "statusMessages":{}}
> >
> > The data looks like this:
> >
> > 
> >  
> > 2165432
> > 5
> >  
> >
> >  
> > 28548113
> > 89
> >  
> > 
> >
> >
> > The data-config file looks like this:
> >
> > 
> >  
> >
> >   >name="xml"
> >pk="id"
> >processor="XPathEntityProcessor"
> >stream="true"
> >forEach="/journal/doc"
> >url="C:/Users/z6mhq/Desktop/data_import/nh_test.xml"
> >transformer="RegexTransformer,TemplateTransformer"
> >>
> >
> >
> >
> >  
> >  
> > 
> >
> > And I referenced the jar files in the solr-config.xml as well as adding the 
> > request-handler by adding the following lines:
> >
> >  > regex="solr-dataimporthandler-\d.*\.jar" />
> >  > regex="solr-dataimporthandler-extras-\d.*\.jar" />
> >
> >
> >  > class="org.apache.solr.handler.dataimport.DataImportHandler">
> >
> >   > name="config">C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml
> >
> >  
> >
> > I am running a core residing in the folder 
> > “C:/Users/z6mhq/Desktop/nh/nh/conf” while the Solr installation is in 
> > “C:/Users/z6mhq/Documents/solr-7.5.0”.
> >
> > I really hope that someone can spot my mistake…
> >
> > Thanks in advance.
> >
> > Martin Frank Hansen
> >
> >
> > Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
> > KMD’s Privatlivspolitik, der 
> > fortæller, hvordan vi behandler oplysninger om dig.
> >
> > Protection of your personal data is important to us. Here you can read 
> > KMD’s Privacy Policy outlining how we 
> > process your personal data.
> >
> > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. 
> > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst 
> > informere afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi 
> > dig slette e-mailen i dit system uden at videresende eller kopiere den. 
> > Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning er fri 
> > for virus og andre fejl, som kan påvirke computeren eller it-systemet, 
> > hvori den modtages og læses, åbnes den på modtagerens eget ansvar. Vi 
> > påtager os ikke noget ansvar for tab og skade, som er opstået i forbindelse 
> > med at modtage og bruge e-mailen.
> >
> > Please note that this message may contain confidential information. If you 
> > have received this message by mistake, please inform the sender of the 
> > mistake by sending a reply, then delete the message from your system 
> > without making, distributing or retaining any copies of it. Although we 
> > believe that the message and any attachments are free from viruses and 
> > other errors that might affect the computer or it-system where it is 
> > received and read, the recipient opens the message at his or her own risk. 
> > We assume no responsibility for any loss or damage arising from the receipt 
> > or use of this message.
>

Re: data-import-handler for solr-7.5.0

2018-10-02 Thread Jan Høydahl

> url="C:/Users/z6mhq/Desktop/data_import/nh_test.xml"

Have you tried url="C:\\Users\\z6mhq/Desktop\\data_import\\nh_test.xml" ?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 2. okt. 2018 kl. 17:15 skrev Martin Frank Hansen (MHQ) :
> 
> Hi,
> 
> I am having some problems getting the data-import-handler in Solr to work. I 
> have tried a lot of things but I simply get no response from Solr, not even 
> an error.
> 
> When calling the API: 
> http://localhost:8983/solr/nh/dataimport?command=full-import
> {
>  "responseHeader":{
>"status":0,
>"QTime":38},
>  "initArgs":[
>"defaults",[
>  "config","C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml"]],
>  "command":"full-import",
>  "status":"idle",
>  "importResponse":"",
>  "statusMessages":{}}
> 
> The data looks like this:
> 
> 
>  
> 2165432
> 5
>  
> 
>  
> 28548113
> 89
>  
> 
> 
> 
> The data-config file looks like this:
> 
> 
>  
>
>  name="xml"
>pk="id"
>processor="XPathEntityProcessor"
>stream="true"
>forEach="/journal/doc"
>url="C:/Users/z6mhq/Desktop/data_import/nh_test.xml"
>transformer="RegexTransformer,TemplateTransformer"
>> 
>
>
> 
>  
>  
> 
> 
> And I referenced the jar files in the solr-config.xml as well as adding the 
> request-handler by adding the following lines:
> 
>  regex="solr-dataimporthandler-\d.*\.jar" />
>  regex="solr-dataimporthandler-extras-\d.*\.jar" />
> 
> 
>  class="org.apache.solr.handler.dataimport.DataImportHandler">
>
>   name="config">C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml
>
>  
> 
> I am running a core residing in the folder 
> “C:/Users/z6mhq/Desktop/nh/nh/conf” while the Solr installation is in 
> “C:/Users/z6mhq/Documents/solr-7.5.0”.
> 
> I really hope that someone can spot my mistake…
> 
> Thanks in advance.
> 
> Martin Frank Hansen
> 
> 
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
> KMD’s Privatlivspolitik, der fortæller, 
> hvordan vi behandler oplysninger om dig.
> 
> Protection of your personal data is important to us. Here you can read KMD’s 
> Privacy Policy outlining how we process 
> your personal data.
> 
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. 
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
> afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
> e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen 
> og ethvert vedhæftet bilag efter vores overbevisning er fri for virus og 
> andre fejl, som kan påvirke computeren eller it-systemet, hvori den modtages 
> og læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget 
> ansvar for tab og skade, som er opstået i forbindelse med at modtage og bruge 
> e-mailen.
> 
> Please note that this message may contain confidential information. If you 
> have received this message by mistake, please inform the sender of the 
> mistake by sending a reply, then delete the message from your system without 
> making, distributing or retaining any copies of it. Although we believe that 
> the message and any attachments are free from viruses and other errors that 
> might affect the computer or it-system where it is received and read, the 
> recipient opens the message at his or her own risk. We assume no 
> responsibility for any loss or damage arising from the receipt or use of this 
> message.

Re: Data Import Handler with Solr Source behind Load Balancer

2018-09-14 Thread Emir Arnautović

Hi Thomas,
Is this SolrCloud or Solr master-slave? Do you update index while indexing? Did 
you check if all your instances behind LB are in sync if you are using 
master-slave?
My guess would be that DIH is using cursors to read data from another Solr. If 
you are using multiple Solr instances behind LB there might be some diffs in 
index that results in different documents being returned for the same cursor 
mark. Is num doc and max doc the same on new instance after import?

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/

> On 12 Sep 2018, at 05:53, Zimmermann, Thomas  
> wrote:
> 
> We have a Solr v7 Instance sourcing data from a Data Import Handler with a 
> Solr data source running Solr v4. When it hits a single server in that 
> instance directly, all documents are read and written correctly to the v7. 
> When we hit the load balancer DNS entry, the resulting data import handler 
> json states that it read all the documents and skipped none, and all looks 
> fine, but the result set is missing ~20% of the documents in the v7 core. 
> This has happened multiple time on multiple environments.
> 
> Any thoughts on whether this might be a bug in the underlying DIH code? I'll 
> also pass it along to the server admins on our side for input.

Re: Data Import Handler on 6.4.1

2017-03-15 Thread Walter Underwood

Also, upgrade to 6.4.2. There are serious performance problems in 6.4.0 and 
6.4.1.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 15, 2017, at 12:05 PM, Liu, Daphne  
> wrote:
> 
> For Solr 6.3,  I have to move mine to 
> ../solr-6.3.0/server/solr-webapp/webapp/WEB-INF/lib. If you are using jetty.
> 
> Kind regards,
> 
> Daphne Liu
> BI Architect - Matrix SCM
> 
> CEVA Logistics / 10751 Deerwood Park Blvd, Suite 200, Jacksonville, FL 32256 
> USA / www.cevalogistics.com T 904.564.1192 / F 904.928.1448 / 
> daphne@cevalogistics.com
> 
> 
> -Original Message-
> From: Michael Tobias [mailto:mtob...@btinternet.com]
> Sent: Wednesday, March 15, 2017 2:36 PM
> To: solr-user@lucene.apache.org
> Subject: Data Import Handler on 6.4.1
> 
> I am sure I am missing something simple but
> 
> I am running Solr 4.8.1 and trialling 6.4.1 on another computer.
> 
> I have had to manually modify the automatic 6.4.1 scheme config as we use a 
> set of specialised field types.  They work fine.
> 
> I am now trying to populate my core with data and having problems.
> 
> Exactly what names/paths should I be using in the solrconfig.xml file to get 
> this working - I don’t recall doing ANYTHING for 4.8.1
> 
>   regex=".*\.jar" />  
>   regex="solr-dataimporthandler-.*\.jar" /> ?
> 
> And where do I put the mysql-connector-java-5.1.29-bin.jar file and how do I 
> reference it to get it loaded?
> 
>
> ??
> 
> And then later in the solrconfig.xml I have:
> 
>  class="org.apache.solr.handler.dataimport.DataImportHandler">
>  
>db-data-config.xml
>  
> 
> 
> 
> Any help much appreciated.
> 
> Regards
> 
> Michael
> 
> 
> -Original Message-
> From: David Hastings [mailto:hastings.recurs...@gmail.com]
> Sent: 15 March 2017 17:47
> To: solr-user@lucene.apache.org
> Subject: Re: Get handler not working
> 
> from your previous email:
> "There is no "id"
> field defined in the schema."
> 
> you need an id field to use the get handler
> 
> On Wed, Mar 15, 2017 at 1:45 PM, Chris Ulicny  wrote:
> 
>> I thought that "id" and "ids" were fixed parameters for the get
>> handler, but I never remember, so I've already tried both. Each time
>> it comes back with the same response of no document.
>> 
>> On Wed, Mar 15, 2017 at 1:31 PM Alexandre Rafalovitch
>> 
>> wrote:
>> 
>>> Actually.
>>> 
>>> I think Real Time Get handler has "id" as a magical parameter, not
>>> as a field name. It maps to the real id field via the uniqueKey
>>> definition:
>>> https://cwiki.apache.org/confluence/display/solr/RealTime+Get
>>> 
>>> So, if you have not, could you try the way you originally wrote it.
>>> 
>>> Regards,
>>>   Alex.
>>> 
>>> http://www.solr-start.com/ - Resources for Solr users, new and
>> experienced
>>> 
>>> 
>>> On 15 March 2017 at 13:22, Chris Ulicny  wrote:
 Sorry, that is a typo. The get is using the iqdocid field. There
 is no
>>> "id"
 field defined in the schema.
 
 solr/TestCollection/get?iqdocid=2957-TV-201604141900
 
 solr/TestCollection/select?q=*:*=iqdocid:2957-TV-201604141900
 
 On Wed, Mar 15, 2017 at 1:15 PM Erick Erickson <
>> erickerick...@gmail.com>
 wrote:
 
> Is this a typo or are you trying to use get with an "id" field
> and your filter query uses "iqdocid"?
> 
> Best,
> Erick
> 
> On Wed, Mar 15, 2017 at 8:31 AM, Chris Ulicny 
>> wrote:
>> Yes, we're using a fixed schema with the iqdocid field set as
>> the
> uniqueKey.
>> 
>> On Wed, Mar 15, 2017 at 11:28 AM Alexandre Rafalovitch <
> arafa...@gmail.com>
>> wrote:
>> 
>>> What is your uniqueKey? Is it iqdocid?
>>> 
>>> Regards,
>>>   Alex.
>>> 
>>> http://www.solr-start.com/ - Resources for Solr users, new and
> experienced
>>> 
>>> 
>>> On 15 March 2017 at 11:24, Chris Ulicny  wrote:
 Hi,
 
 I've been trying to use the get handler for a new solr cloud
> collection
>>> we
 are using, and something seems to be amiss.
 
 We are running 6.3.0, so we did not explicitly define the
 request
> handler
 in the solrconfig since it's supposed to be implicitly defined.
>> We
> also
 have the update log enabled with the default configuration.
 
 Whenever I send a get query for a document already known to
 be in
>>> the
 collection, I get no documents returned. But when I use a
 filter
> query on
 the uniqueKey field for the same value I get the document
 back
 
 solr/TestCollection/get?id=2957-TV-201604141900
 
 solr/TestCollection/select?q=*:*=iqdocid:2957-TV-20160414
 1900
 
 Is there some configuration

RE: Data Import Handler on 6.4.1

2017-03-15 Thread Liu, Daphne

For Solr 6.3,  I have to move mine to 
../solr-6.3.0/server/solr-webapp/webapp/WEB-INF/lib. If you are using jetty.

Kind regards,

Daphne Liu
BI Architect - Matrix SCM

CEVA Logistics / 10751 Deerwood Park Blvd, Suite 200, Jacksonville, FL 32256 
USA / www.cevalogistics.com T 904.564.1192 / F 904.928.1448 / 
daphne@cevalogistics.com


-Original Message-
From: Michael Tobias [mailto:mtob...@btinternet.com]
Sent: Wednesday, March 15, 2017 2:36 PM
To: solr-user@lucene.apache.org
Subject: Data Import Handler on 6.4.1

I am sure I am missing something simple but

I am running Solr 4.8.1 and trialling 6.4.1 on another computer.

I have had to manually modify the automatic 6.4.1 scheme config as we use a set 
of specialised field types.  They work fine.

I am now trying to populate my core with data and having problems.

Exactly what names/paths should I be using in the solrconfig.xml file to get 
this working - I don’t recall doing ANYTHING for 4.8.1


   ?

And where do I put the mysql-connector-java-5.1.29-bin.jar file and how do I 
reference it to get it loaded?


??

And then later in the solrconfig.xml I have:


  
db-data-config.xml
  



Any help much appreciated.

Regards

Michael


-Original Message-
From: David Hastings [mailto:hastings.recurs...@gmail.com]
Sent: 15 March 2017 17:47
To: solr-user@lucene.apache.org
Subject: Re: Get handler not working

from your previous email:
"There is no "id"
field defined in the schema."

you need an id field to use the get handler

On Wed, Mar 15, 2017 at 1:45 PM, Chris Ulicny  wrote:

> I thought that "id" and "ids" were fixed parameters for the get
> handler, but I never remember, so I've already tried both. Each time
> it comes back with the same response of no document.
>
> On Wed, Mar 15, 2017 at 1:31 PM Alexandre Rafalovitch
> 
> wrote:
>
> > Actually.
> >
> > I think Real Time Get handler has "id" as a magical parameter, not
> > as a field name. It maps to the real id field via the uniqueKey
> > definition:
> > https://cwiki.apache.org/confluence/display/solr/RealTime+Get
> >
> > So, if you have not, could you try the way you originally wrote it.
> >
> > Regards,
> >Alex.
> > 
> > http://www.solr-start.com/ - Resources for Solr users, new and
> experienced
> >
> >
> > On 15 March 2017 at 13:22, Chris Ulicny  wrote:
> > > Sorry, that is a typo. The get is using the iqdocid field. There
> > > is no
> > "id"
> > > field defined in the schema.
> > >
> > > solr/TestCollection/get?iqdocid=2957-TV-201604141900
> > >
> > > solr/TestCollection/select?q=*:*=iqdocid:2957-TV-201604141900
> > >
> > > On Wed, Mar 15, 2017 at 1:15 PM Erick Erickson <
> erickerick...@gmail.com>
> > > wrote:
> > >
> > >> Is this a typo or are you trying to use get with an "id" field
> > >> and your filter query uses "iqdocid"?
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >> On Wed, Mar 15, 2017 at 8:31 AM, Chris Ulicny 
> wrote:
> > >> > Yes, we're using a fixed schema with the iqdocid field set as
> > >> > the
> > >> uniqueKey.
> > >> >
> > >> > On Wed, Mar 15, 2017 at 11:28 AM Alexandre Rafalovitch <
> > >> arafa...@gmail.com>
> > >> > wrote:
> > >> >
> > >> >> What is your uniqueKey? Is it iqdocid?
> > >> >>
> > >> >> Regards,
> > >> >>Alex.
> > >> >> 
> > >> >> http://www.solr-start.com/ - Resources for Solr users, new and
> > >> experienced
> > >> >>
> > >> >>
> > >> >> On 15 March 2017 at 11:24, Chris Ulicny  wrote:
> > >> >> > Hi,
> > >> >> >
> > >> >> > I've been trying to use the get handler for a new solr cloud
> > >> collection
> > >> >> we
> > >> >> > are using, and something seems to be amiss.
> > >> >> >
> > >> >> > We are running 6.3.0, so we did not explicitly define the
> > >> >> > request
> > >> handler
> > >> >> > in the solrconfig since it's supposed to be implicitly defined.
> We
> > >> also
> > >> >> > have the update log enabled with the default configuration.
> > >> >> >
> > >> >> > Whenever I send a get query for a document already known to
> > >> >> > be in
> > the
> > >> >> > collection, I get no documents returned. But when I use a
> > >> >> > filter
> > >> query on
> > >> >> > the uniqueKey field for the same value I get the document
> > >> >> > back
> > >> >> >
> > >> >> > solr/TestCollection/get?id=2957-TV-201604141900
> > >> >> >
> > >> >> > solr/TestCollection/select?q=*:*=iqdocid:2957-TV-20160414
> > >> >> > 1900
> > >> >> >
> > >> >> > Is there some configuration that I am missing?
> > >> >> >
> > >> >> > Thanks,
> > >> >> > Chris
> > >> >>
> > >>
> >
>

This e-mail message is intended for the above named recipient(s) only. It may 
contain confidential information that is privileged. If you are not the 
intended recipient, you are hereby notified that any dissemination, 
distribution or copying of this e-mail and any attachment(s) is strictly 
prohibited. If you have received this e-mail

Re: Data Import Handler, also "Real Time" index updates

2017-03-05 Thread Damien Kamerman

You could configure the dataimporthandler to not delete at the start
(either do a delta or set the preimportdeltequery), and set a
postimportdeletequery if required.

On Saturday, 4 March 2017, Alexandre Rafalovitch  wrote:

> Commit is index global. So if you have overlapping timelines and commit is
> issued, it will affect all changes done to that point.
>
> So, the aliases may be better for you. You could potentially also reload a
> cure with changes solrconfig.XML settings, but that's heavy on caches.
>
> Regards,
>Alex
>
> On 3 Mar 2017 1:21 PM, "Sales"  >
> wrote:
>
>
> >
> > You have indicated that you have a way to avoid doing updates during the
> > full import.  Because of this, you do have another option that is likely
> > much easier for you to implement:  Set the "commitWithin" parameter on
> > each update request.  This works almost identically to autoSoftCommit,
> > but only after a request is made.  As long as there are never any of
> > these updates during a full import, these commits cannot affect that
> import.
>
> I had attempted at least to say that there may be a few updates that happen
> at the start of an import, so, they are while an import is happening just
> due to timing issues. Those will be detected, and, re-executed once the
> import is done though. But my question here is if the update is using
> commitWithin, then, does that only affect those updates that have the
> parameter, or, does it then also soft commit the in progress import? I
> cannot guarantee that zero updates will be done as there is a timing issue
> at the very start of the import, so, a few could cross over.
>
> Adding commitWithin is fine. Just want to make sure those that might
> execute for the first few seconds of an import don’t kill anything.
> >
> > No matter what is happening, you should have autoCommit (not
> > autoSoftCommit) configured with openSearcher set to false.  This will
> > ensure transaction log rollover, without affecting change visibility.  I
> > recommend a maxTime of one to five minutes for this.  You'll see 15
> > seconds as the recommended value in many places.
> >
> > https://lucidworks.com/2013/08/23/understanding-
> transaction-logs-softcommit-and-commit-in-sorlcloud/ <
> https://lucidworks.com/2013/08/23/understanding-
> transaction-logs-softcommit-
> and-commit-in-sorlcloud/>
>
> Oh, we are fine with much longer, does not have to be instant. 10-15
> minutes would be fine.
>
> >
> > Thanks
> > Shawn
> >
>

Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Alexandre Rafalovitch

Commit is index global. So if you have overlapping timelines and commit is
issued, it will affect all changes done to that point.

So, the aliases may be better for you. You could potentially also reload a
cure with changes solrconfig.XML settings, but that's heavy on caches.

Regards,
   Alex

On 3 Mar 2017 1:21 PM, "Sales" 
wrote:

>
> You have indicated that you have a way to avoid doing updates during the
> full import.  Because of this, you do have another option that is likely
> much easier for you to implement:  Set the "commitWithin" parameter on
> each update request.  This works almost identically to autoSoftCommit,
> but only after a request is made.  As long as there are never any of
> these updates during a full import, these commits cannot affect that
import.

I had attempted at least to say that there may be a few updates that happen
at the start of an import, so, they are while an import is happening just
due to timing issues. Those will be detected, and, re-executed once the
import is done though. But my question here is if the update is using
commitWithin, then, does that only affect those updates that have the
parameter, or, does it then also soft commit the in progress import? I
cannot guarantee that zero updates will be done as there is a timing issue
at the very start of the import, so, a few could cross over.

Adding commitWithin is fine. Just want to make sure those that might
execute for the first few seconds of an import don’t kill anything.
>
> No matter what is happening, you should have autoCommit (not
> autoSoftCommit) configured with openSearcher set to false.  This will
> ensure transaction log rollover, without affecting change visibility.  I
> recommend a maxTime of one to five minutes for this.  You'll see 15
> seconds as the recommended value in many places.
>
> https://lucidworks.com/2013/08/23/understanding-
transaction-logs-softcommit-and-commit-in-sorlcloud/ <
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-
and-commit-in-sorlcloud/>

Oh, we are fine with much longer, does not have to be instant. 10-15
minutes would be fine.

>
> Thanks
> Shawn
>

Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Sales


> 
> You have indicated that you have a way to avoid doing updates during the
> full import.  Because of this, you do have another option that is likely
> much easier for you to implement:  Set the "commitWithin" parameter on
> each update request.  This works almost identically to autoSoftCommit,
> but only after a request is made.  As long as there are never any of
> these updates during a full import, these commits cannot affect that import.

I had attempted at least to say that there may be a few updates that happen at 
the start of an import, so, they are while an import is happening just due to 
timing issues. Those will be detected, and, re-executed once the import is done 
though. But my question here is if the update is using commitWithin, then, does 
that only affect those updates that have the parameter, or, does it then also 
soft commit the in progress import? I cannot guarantee that zero updates will 
be done as there is a timing issue at the very start of the import, so, a few 
could cross over. 

Adding commitWithin is fine. Just want to make sure those that might execute 
for the first few seconds of an import don’t kill anything. 
> 
> No matter what is happening, you should have autoCommit (not
> autoSoftCommit) configured with openSearcher set to false.  This will
> ensure transaction log rollover, without affecting change visibility.  I
> recommend a maxTime of one to five minutes for this.  You'll see 15
> seconds as the recommended value in many places.
> 
> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>  
> 

Oh, we are fine with much longer, does not have to be instant. 10-15 minutes 
would be fine.

> 
> Thanks
> Shawn
>

Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Shawn Heisey

On 3/3/2017 10:17 AM, Sales wrote:
> I am not sure how best to handle this. We use the data import handle to 
> re-sync all our data on a daily basis, takes 1-2 hours depending on system 
> load. It is set up to commit at the end, so, the old index remains until it’s 
> done, and, we lose no access while the import is happening.
>
> But, we now want to update certain fields in the index, but still regen 
> daily. So, it would seem we might need to autocommit, and, soft commit 
> potentially. When we enabled those, during the index, the data disappeared 
> since it kept soft committing during the import process, I see no way to 
> avoid soft commits during the import. But soft commits would appear to be 
> needed for the (non import) updates to the index. 
>
> I realize the import could happen while an update is done, but we can 
> actually avoid those. So, that is not an issue (one or two might go through, 
> but, we will redo those updates once the index is done, that part is all 
> handled.

Erick's solution of using aliases to swap a live index and a build index
is one very good way to go.  It does involve some additional complexity
that you may not be ready for.  Only you will know whether that's
something you can implement easily.  Collection aliasing was implemented
in Solr 4.2 by SOLR-4497, so 4.10 should definitely have it.

You have indicated that you have a way to avoid doing updates during the
full import.  Because of this, you do have another option that is likely
much easier for you to implement:  Set the "commitWithin" parameter on
each update request.  This works almost identically to autoSoftCommit,
but only after a request is made.  As long as there are never any of
these updates during a full import, these commits cannot affect that import.

No matter what is happening, you should have autoCommit (not
autoSoftCommit) configured with openSearcher set to false.  This will
ensure transaction log rollover, without affecting change visibility.  I
recommend a maxTime of one to five minutes for this.  You'll see 15
seconds as the recommended value in many places.

https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Thanks
Shawn

Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Sales

> On Mar 3, 2017, at 11:30 AM, Erick Erickson  wrote:
> 
> One way to handle this (presuming SolrCloud) is collection aliasing.
> You create two collections, c1 and c2. You then have two aliases. when
> you start "index" is aliased to c1 and "search" is aliased to c2. Now
> do your full import  to "index" (and, BTW, you'd be well advised to do
> at least a hard commit openSearcher=false during that time or you risk
> replaying all the docs in the tlog).
> 
> When the full import is done, switch the aliases so "search" points to c1 and
> "index" points to c2. Rinse. Repeat. Your client apps always use the same 
> alias,
> the alias switching makes whether c1 or c2 is being used transparent.
> By that I mean your user-facing app uses "search" and your indexing client
> uses "index".
> 
> You can now do your live updates to the "search" alias that has a soft
> commit set.
> Of course you have to have some mechanism for replaying all the live updates
> that came in when you were doing your full index into the "indexing"
> alias before
> you switch, but you say you have that handled.
> 
> Best,
> Erick
> 

Thanks. So, is this available on 4.10.4? 

If not, we used to gen another core, do the import, and, swap cores so this is 
possibly similar to collection aliases since in the end, the client did not 
care. I don’t see why that would not still work. Took a little effort to 
automate, but, not much. 

Regarding the import and commit, we use in data-config.xml readonly so this 
sets autocommit the way I understand it. Not sure what happens with 
opensearcher though. If that is not sufficient, how would I do hard commit and 
opensearcher false during that time? Surely not by modifying the config file?

Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Erick Erickson

One way to handle this (presuming SolrCloud) is collection aliasing.
You create two collections, c1 and c2. You then have two aliases. when
you start "index" is aliased to c1 and "search" is aliased to c2. Now
do your full import  to "index" (and, BTW, you'd be well advised to do
at least a hard commit openSearcher=false during that time or you risk
replaying all the docs in the tlog).

When the full import is done, switch the aliases so "search" points to c1 and
"index" points to c2. Rinse. Repeat. Your client apps always use the same alias,
the alias switching makes whether c1 or c2 is being used transparent.
By that I mean your user-facing app uses "search" and your indexing client
uses "index".

You can now do your live updates to the "search" alias that has a soft
commit set.
Of course you have to have some mechanism for replaying all the live updates
that came in when you were doing your full index into the "indexing"
alias before
you switch, but you say you have that handled.

Best,
Erick

On Fri, Mar 3, 2017 at 9:22 AM, Alexandre Rafalovitch
 wrote:
> On 3 March 2017 at 12:17, Sales  
> wrote:
>> When we enabled those, during the index, the data disappeared since it kept 
>> soft committing during the import process,
>
> This part does not quite make sense. Could you expand on this "data
> disappeared" part to understand what the issue is.
>
> The main issue with "update" is that all fields (apart from pure
> copyField destinations) need to be stored, so the document can be
> reconstructed, updated, re-indexed. Perhaps you have something strange
> happening around that?
>
> Regards,
>Alex.
>
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced

Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Sales

> 
> On Mar 3, 2017, at 11:22 AM, Alexandre Rafalovitch  wrote:
> 
> On 3 March 2017 at 12:17, Sales  
> wrote:
>> When we enabled those, during the index, the data disappeared since it kept 
>> soft committing during the import process,
> 
> This part does not quite make sense. Could you expand on this "data
> disappeared" part to understand what the issue is.
> 

So, the issue here is the first part of the import handler is to erase all the 
data, so, there are no products left in the index (it would appear based on 
what we see, after the first softcommit), and, a search returns no result at 
first, but, ever increasing number of records while the import is happening. We 
have 6 million indexed products.

I can't find a way to stop soft commits during the import?

Re: Data Import Handler, also "Real Time" index updates

2017-03-03 Thread Alexandre Rafalovitch

On 3 March 2017 at 12:17, Sales  wrote:
> When we enabled those, during the index, the data disappeared since it kept 
> soft committing during the import process,

This part does not quite make sense. Could you expand on this "data
disappeared" part to understand what the issue is.

The main issue with "update" is that all fields (apart from pure
copyField destinations) need to be stored, so the document can be
reconstructed, updated, re-indexed. Perhaps you have something strange
happening around that?

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced

Re: Data Import Handler - maximum?

2016-12-12 Thread Shawn Heisey

On 12/11/2016 8:00 PM, Brian Narsi wrote:
> We are using Solr 5.1.0 and DIH to build index.
>
> We are using DIH with clean=true and commit=true and optimize=true.
> Currently retrieving about 10.5 million records in about an hour.
>
> I will like to find from other member's experiences as to how long can DIH
> run with no issues? What is the maximum number of records that anyone has
> pulled using DIH?
>
> Are there any limitations on the maximum number of records that can/should
> be pulled using DIH? What is the longest DIH can run?

There are no hard limits other than the Lucene limit of a little over
two billion docs per individual index.  With sharding, Solr is able to
easily overcome this limit on an entire index.

I have one index where each shard was over 50 million docs.  Each shard
has fewer docs now, because I changed it so there are more shards and
more machines.  For some reason the rebuild time (using DIH) got really
really long -- nearly 48 hours -- while building every shard in
parallel.  Still haven't figured out why the build time increased
dramatically.

One problem you might run into with DIH from a database has to do with
merging.  With default merge scheduler settings, eventually (typically
when there are millions of rows being imported) you'll run into a pause
in indexing that will take so long that the database connection will
close, causing the import to fail after the pause finishes.

I even opened a Lucene issue to get the default value for maxMergeCount
changed.  This issue went nowhere:

https://issues.apache.org/jira/browse/LUCENE-5705

Here's a thread from this mailing list discussing the problem and the
configuration solution:

http://lucene.472066.n3.nabble.com/What-does-quot-too-many-merges-stalling-quot-in-indexwriter-log-mean-td4077380.html

Thanks,
Shawn

Re: Data Import Handler - maximum?

2016-12-12 Thread Bernd Fehling


Am 12.12.2016 um 04:00 schrieb Brian Narsi:
> We are using Solr 5.1.0 and DIH to build index.
> 
> We are using DIH with clean=true and commit=true and optimize=true.
> Currently retrieving about 10.5 million records in about an hour.
> 
> I will like to find from other member's experiences as to how long can DIH
> run with no issues? What is the maximum number of records that anyone has
> pulled using DIH?

Afaik, DIH will run until maximum number of documents per index.
Our longest run took about 3.5 days for single DIH and over 100 mio. docs.
The runtime depends pretty much on the complexity of the analysis during 
loading.

Currently we are using concurrent DIH with 12 processes which takes 15 hours
for the same amount. Optimizing afterwards takes 9.5 hours.

SolrJ with 12 threads is doing the same indexing within 7.5 hours plus 
optimizing.
For huge amounts of data you should consider using SolrJ.

> 
> Are there any limitations on the maximum number of records that can/should
> be pulled using DIH? What is the longest DIH can run?
> 
> Thanks a bunch!
>

RE: Data import handler in techproducts example

2016-07-07 Thread Brooks Chuck (FCA)

Hello Jonas,

Did you figure this out? 

Dr. Chuck Brooks
248-838-5070


-Original Message-
From: Jonas Vasiliauskas [mailto:jonas.vasiliaus...@yahoo.com.INVALID] 
Sent: Saturday, July 02, 2016 11:37 AM
To: solr-user@lucene.apache.org
Subject: Data import handler in techproducts example

Hey,

I'm quite new to solr and java environments. I have a goal for myself to import 
some data from mysql database in techproducts (core) example.

I have setup data import handler (DIH) for techproducts based on instructions 
here https://wiki.apache.org/solr/DIHQuickStart , but looks like solr doesn't 
load DIH libraries, could someone please explain in quick words on how to check 
if DIH is loaded and if not - how can I load it ?

Stacktrace is here: http://pastebin.ca/3654347

Thanks,

Re: Data import handler in techproducts example

2016-07-02 Thread Ahmet Arslan

Hi Jonas,

Search for the 
solr-dataimporthandler-*.jar place it under a lib directory (same level as the 
solr.xml file) along with the mysql jdbc driver (mysql-connector-java-*.jar)

Please see:
https://cwiki.apache.org/confluence/display/solr/Lib+Directives+in+SolrConfig




On Saturday, July 2, 2016 9:56 PM, Jonas Vasiliauskas 
 wrote:
Hey,

I'm quite new to solr and java environments. I have a goal for myself to 
import some data from mysql database in techproducts (core) example.

I have setup data import handler (DIH) for techproducts based on 
instructions here https://wiki.apache.org/solr/DIHQuickStart , but looks 
like solr doesn't load DIH libraries, could someone please explain in 
quick words on how to check if DIH is loaded and if not - how can I load 
it ?

Stacktrace is here: http://pastebin.ca/3654347

Thanks,

Re: "data import handler : import data from sql database :how to search in all fields"

2016-05-26 Thread Erick Erickson

There's nothing saying you have
to highlight fields you search on. So you
can specify hl.fl to be the "normal" (perhaps
stored-only) fields and still search on the
uber-field.

Best,
Erick

On Thu, May 26, 2016 at 2:08 PM, kostali hassan
 wrote:
> I did it , I copied all my dynamic field into text field and it work great.
> just one question even if I copied text into content and the inverse for
> get highliting , thats not work ,they are another way to get highliting?
> thank you eric
>
> 2016-05-26 18:28 GMT+01:00 Erick Erickson :
>
>> And, you can copy all of the fields into an "uber field" using the
>> copyField directive and just search the "uber field".
>>
>> Best,
>> Erick
>>
>> On Thu, May 26, 2016 at 7:35 AM, kostali hassan
>>  wrote:
>> > thank you it make sence .
>> > have a good day
>> >
>> > 2016-05-26 15:31 GMT+01:00 Siddhartha Singh Sandhu > >:
>> >
>> >> The schema.xml/managed_schema defines the default search field as
>> `text`.
>> >>
>> >> You can make all fields that you want searchable type `text`.
>> >>
>> >> On Thu, May 26, 2016 at 10:23 AM, kostali hassan <
>> >> med.has.kost...@gmail.com>
>> >> wrote:
>> >>
>> >> > I import data from sql databases with DIH . I am looking for serch
>> term
>> >> in
>> >> > all fields not by field.
>> >> >
>> >>
>>

Re: "data import handler : import data from sql database :how to search in all fields"

2016-05-26 Thread kostali hassan

I did it , I copied all my dynamic field into text field and it work great.
just one question even if I copied text into content and the inverse for
get highliting , thats not work ,they are another way to get highliting?
thank you eric

2016-05-26 18:28 GMT+01:00 Erick Erickson :

> And, you can copy all of the fields into an "uber field" using the
> copyField directive and just search the "uber field".
>
> Best,
> Erick
>
> On Thu, May 26, 2016 at 7:35 AM, kostali hassan
>  wrote:
> > thank you it make sence .
> > have a good day
> >
> > 2016-05-26 15:31 GMT+01:00 Siddhartha Singh Sandhu  >:
> >
> >> The schema.xml/managed_schema defines the default search field as
> `text`.
> >>
> >> You can make all fields that you want searchable type `text`.
> >>
> >> On Thu, May 26, 2016 at 10:23 AM, kostali hassan <
> >> med.has.kost...@gmail.com>
> >> wrote:
> >>
> >> > I import data from sql databases with DIH . I am looking for serch
> term
> >> in
> >> > all fields not by field.
> >> >
> >>
>

Re: "data import handler : import data from sql database :how to search in all fields"

2016-05-26 Thread Erick Erickson

And, you can copy all of the fields into an "uber field" using the
copyField directive and just search the "uber field".

Best,
Erick

On Thu, May 26, 2016 at 7:35 AM, kostali hassan
 wrote:
> thank you it make sence .
> have a good day
>
> 2016-05-26 15:31 GMT+01:00 Siddhartha Singh Sandhu :
>
>> The schema.xml/managed_schema defines the default search field as `text`.
>>
>> You can make all fields that you want searchable type `text`.
>>
>> On Thu, May 26, 2016 at 10:23 AM, kostali hassan <
>> med.has.kost...@gmail.com>
>> wrote:
>>
>> > I import data from sql databases with DIH . I am looking for serch term
>> in
>> > all fields not by field.
>> >
>>

Re: "data import handler : import data from sql database :how to search in all fields"

2016-05-26 Thread kostali hassan

thank you it make sence .
have a good day

2016-05-26 15:31 GMT+01:00 Siddhartha Singh Sandhu :

> The schema.xml/managed_schema defines the default search field as `text`.
>
> You can make all fields that you want searchable type `text`.
>
> On Thu, May 26, 2016 at 10:23 AM, kostali hassan <
> med.has.kost...@gmail.com>
> wrote:
>
> > I import data from sql databases with DIH . I am looking for serch term
> in
> > all fields not by field.
> >
>

Re: "data import handler : import data from sql database :how to search in all fields"

2016-05-26 Thread Siddhartha Singh Sandhu

The schema.xml/managed_schema defines the default search field as `text`.

You can make all fields that you want searchable type `text`.

On Thu, May 26, 2016 at 10:23 AM, kostali hassan 
wrote:

> I import data from sql databases with DIH . I am looking for serch term in
> all fields not by field.
>

Re: Data Import Handler - Multivalued fields - splitBy

2016-02-27 Thread saravanan1980

It's resolved after changing my column name..its all case sensitive...





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Multivalued-fields-splitBy-tp4243667p4260301.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data Import Handler - Multivalued fields - splitBy

2016-02-27 Thread saravanan1980

I am also having the same problem.

Have you resolved this issue?

"response": {
"numFound": 3,
"start": 0,
"docs": [
  {
"genre": [
  "Action|Adventure",
  "Action",
  "Adventure"
]
  },
  {
"genre": [
  "Drama|Suspense",
  "Drama",
  "Suspense"
]
  },
  {
"genre": [
  "Adventure|Family|Fantasy|Science Fiction",
  "Adventure",
  "Family",
  "Fantasy",
  "Science Fiction"
]
  }
]
  }

Please let me know, if it is resolved...

 








--
View this message in context: 
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Multivalued-fields-splitBy-tp4243667p4260300.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data Import Handler Usage

2016-02-16 Thread vidya

Hi

Dataimport section in web ui page still shows me that no data import handler
is defined. And no data is being added to my new collection.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Usage-tp4257518p4257576.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data Import Handler Usage

2016-02-16 Thread Erik Hatcher

The "other" collection (destination of the import) is the collection where that 
data import handler definition resides. 

   Erik

> On Feb 16, 2016, at 01:54, vidya  wrote:
> 
> Hi
> 
> I have gone through documents to define data import handler in solr. But i
> couldnot implement it.
> I have created data-config.xml file that specifies moving data from
> collection1 core to another collection, i donno where i need to specify that
> second collection.
> 
> 
>  
> url="http://localhost:8983/solr/collection1; query="*:*"/>
>  
> 
> 
> and request handler is defined as follows in solrconfig.xml
> 
>  class="org.apache.solr.handler.dataimport.DataImportHandler">
>
>  /home/username/data-config.xml
>
>  
> 
> Even after adding this, i couldnot get any data import handler in web url
> page for importing.
> Why is it so? And what changes need to be done?
> I have followed the following url : 
> http://www.codewrecks.com/blog/index.php/2013/4/29/loading-data-from-sql-server-to-solr-with-a-data-import-handler
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Data-Import-Handler-Usage-tp4257518.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data Import Handler - autoSoftCommit and autoCommit

2016-02-08 Thread Rajesh Hazari

we have this for a collection which updated every 3mins with min of 500
documents and once in a day of 10k documents in start of the day

   ${solr.autoCommit.maxTime:30}
1
true
true

  ${solr.autoSoftCommit.maxTime:6000}

As per solr documentation, If you have solr client to index documents,
its not suggested to use commit=true and optimize=true explicitly.

we have not tested data import handle with 10 million records.

we have settled with this config after many tests and after understanding
the need and requirements.

*Rajesh**.*

On Mon, Feb 8, 2016 at 10:15 AM, Troy Edwards 
wrote:

> We are running the data import handler to retrieve about 10 million records
> during work hours every day of the week. We are using Clean = true, Commit
> = true and Optimize = true. The entire process takes about 1 hour.
>
> What would be a good setting for autoCommit and autoSoftCommit?
>
> Thanks
>

Re: Data Import Handler - autoSoftCommit and autoCommit

2016-02-08 Thread Susheel Kumar

You can start with one of the suggestions from this link based on your
indexing and query load.


https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/


Thanks,
Susheel

On Mon, Feb 8, 2016 at 10:15 AM, Troy Edwards 
wrote:

> We are running the data import handler to retrieve about 10 million records
> during work hours every day of the week. We are using Clean = true, Commit
> = true and Optimize = true. The entire process takes about 1 hour.
>
> What would be a good setting for autoCommit and autoSoftCommit?
>
> Thanks
>

Re: Data Import Handler takes different time on different machines

2016-02-03 Thread Troy Edwards

While researching the space on the servers, I found that log files from
Sept 2015 are still there. These are solr_gc_log_datetime and
solr_log_datetime.

Is the default logging for Solr ok for production systems or does it need
to be changed/tuned?

Thanks,

On Tue, Feb 2, 2016 at 2:04 PM, Troy Edwards 
wrote:

> That is help!
>
> Thank you for the thoughts.
>
>
> On Tue, Feb 2, 2016 at 12:17 PM, Erick Erickson 
> wrote:
>
>> Scratch that installation and start over?
>>
>> Really, it sounds like something is fundamentally messed up with the
>> Linux install. Perhaps something as simple as file paths, or you have
>> old jars hanging around that are mis-matched. Or someone manually
>> deleted files from the Solr install. Or your disk filled up. Or
>>
>> How sure are you that the linux setup was done properly?
>>
>> Not much help I know,
>> Erick
>>
>> On Tue, Feb 2, 2016 at 10:11 AM, Troy Edwards 
>> wrote:
>> > Rerunning the Data Import Handler again on the the linux machine has
>> > started producing some errors and warnings:
>> >
>> > On the node on which DIH was started:
>> >
>> > WARN SolrWriter Error creating document : SolrInputDocument
>> >
>> > org.apache.solr.common.SolrException: No registered leader was found
>> > after waiting for 4000ms , collection: collectionmain slice: shard1
>> >
>> >
>> >
>> > On the second node:
>> >
>> > WARN ReplicationHandler Exception while writing response for params:
>> >
>> command=filecontent=true=1047=/replication=filestream=_1oo_Lucene50_0.tip
>> >
>> > java.nio.file.NoSuchFileException:
>> >
>> /var/solr/data/collectionmain_shard2_replica1/data/index/_1oo_Lucene50_0.tip
>> >
>> >
>> > ERROR
>> >
>> > Index fetch failed :org.apache.solr.common.SolrException: Unable to
>> > download _169.si completely. Downloaded 0!=466
>> >
>> >
>> > ReplicationHandler Index fetch failed
>> > :org.apache.solr.common.SolrException: Unable to download _169.si
>> > completely. Downloaded 0!=466
>> >
>> > WARN
>> > IndexFetcher File _1pd_Lucene50_0.tim did not match. expected checksum
>> is
>> > 3549855722 and actual is checksum 2062372352. expected length is 72522
>> and
>> > actual length is 39227
>> >
>> > WARN UpdateLog Log replay finished.
>> recoveryInfo=RecoveryInfo{adds=840638
>> > deletes=0 deleteByQuery=0 errors=0 positionOfStart=554264}
>> >
>> >
>> > Any suggestions about this?
>> >
>> > Thanks
>> >
>> > On Mon, Feb 1, 2016 at 10:03 PM, Erick Erickson <
>> erickerick...@gmail.com>
>> > wrote:
>> >
>> >> The first thing I'd be looking at is how I the JDBC batch size compares
>> >> between the two machines.
>> >>
>> >> AFAIK, Solr shouldn't notice the difference, and since a large majority
>> >> of the development is done on Linux-based systems, I'd be surprised if
>> >> this was worse than Windows, which would lead me to the one thing that
>> >> is definitely different between the two: Your JDBC driver and its
>> settings.
>> >> At least that's where I'd look first.
>> >>
>> >> If nothing immediate pops up, I'd probably write a small driver
>> program to
>> >> just access the database from the two machines and process your 10M
>> >> records _without_ sending them to Solr and see what the comparison is.
>> >>
>> >> You can also forgo DIH and do a simple import program via SolrJ. The
>> >> advantage here is that the comparison I'm talking about above is
>> >> really simple, just comment out the call that sends data to Solr.
>> Here's an
>> >> example...
>> >>
>> >> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Mon, Feb 1, 2016 at 7:34 PM, Troy Edwards > >
>> >> wrote:
>> >> > Sorry, I should explain further. The Data Import Handler had been
>> running
>> >> > for a while retrieving only about 15 records from the database.
>> Both
>> >> in
>> >> > development env (windows) and linux machine it took about 3 mins.
>> >> >
>> >> > The query has been changed and we are now trying to retrieve about 10
>> >> > million records. We do expect the time to increase.
>> >> >
>> >> > With the new query the time taken on windows machine is consistently
>> >> around
>> >> > 40 mins. While the DIH is running queries slow down i.e. a query that
>> >> > typically took 60 msec takes 100 msec.
>> >> >
>> >> > The time taken on linux machine is consistently around 2.5 hours.
>> While
>> >> the
>> >> > DIH is running queries take about 200  to 400 msec.
>> >> >
>> >> > Thanks!
>> >> >
>> >> > On Mon, Feb 1, 2016 at 8:45 PM, Erick Erickson <
>> erickerick...@gmail.com>
>> >> > wrote:
>> >> >
>> >> >> What happens if you run just the SQL query from the
>> >> >> windows box and from the linux box? Is there any chance
>> >> >> that somehow the connection from the linux box is
>> >> >> just slower?
>> >> >>
>> >> >> Best,
>> >> >> Erick
>> >> >>
>> >> >> On Mon, Feb 1, 2016 at 6:36 PM, Alexandre Rafalovitch
>> >> >>

Re: Data Import Handler takes different time on different machines

2016-02-02 Thread Erick Erickson

Scratch that installation and start over?

Really, it sounds like something is fundamentally messed up with the
Linux install. Perhaps something as simple as file paths, or you have
old jars hanging around that are mis-matched. Or someone manually
deleted files from the Solr install. Or your disk filled up. Or

How sure are you that the linux setup was done properly?

Not much help I know,
Erick

On Tue, Feb 2, 2016 at 10:11 AM, Troy Edwards  wrote:
> Rerunning the Data Import Handler again on the the linux machine has
> started producing some errors and warnings:
>
> On the node on which DIH was started:
>
> WARN SolrWriter Error creating document : SolrInputDocument
>
> org.apache.solr.common.SolrException: No registered leader was found
> after waiting for 4000ms , collection: collectionmain slice: shard1
>
>
>
> On the second node:
>
> WARN ReplicationHandler Exception while writing response for params:
> command=filecontent=true=1047=/replication=filestream=_1oo_Lucene50_0.tip
>
> java.nio.file.NoSuchFileException:
> /var/solr/data/collectionmain_shard2_replica1/data/index/_1oo_Lucene50_0.tip
>
>
> ERROR
>
> Index fetch failed :org.apache.solr.common.SolrException: Unable to
> download _169.si completely. Downloaded 0!=466
>
>
> ReplicationHandler Index fetch failed
> :org.apache.solr.common.SolrException: Unable to download _169.si
> completely. Downloaded 0!=466
>
> WARN
> IndexFetcher File _1pd_Lucene50_0.tim did not match. expected checksum is
> 3549855722 and actual is checksum 2062372352. expected length is 72522 and
> actual length is 39227
>
> WARN UpdateLog Log replay finished. recoveryInfo=RecoveryInfo{adds=840638
> deletes=0 deleteByQuery=0 errors=0 positionOfStart=554264}
>
>
> Any suggestions about this?
>
> Thanks
>
> On Mon, Feb 1, 2016 at 10:03 PM, Erick Erickson 
> wrote:
>
>> The first thing I'd be looking at is how I the JDBC batch size compares
>> between the two machines.
>>
>> AFAIK, Solr shouldn't notice the difference, and since a large majority
>> of the development is done on Linux-based systems, I'd be surprised if
>> this was worse than Windows, which would lead me to the one thing that
>> is definitely different between the two: Your JDBC driver and its settings.
>> At least that's where I'd look first.
>>
>> If nothing immediate pops up, I'd probably write a small driver program to
>> just access the database from the two machines and process your 10M
>> records _without_ sending them to Solr and see what the comparison is.
>>
>> You can also forgo DIH and do a simple import program via SolrJ. The
>> advantage here is that the comparison I'm talking about above is
>> really simple, just comment out the call that sends data to Solr. Here's an
>> example...
>>
>> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>>
>> Best,
>> Erick
>>
>> On Mon, Feb 1, 2016 at 7:34 PM, Troy Edwards 
>> wrote:
>> > Sorry, I should explain further. The Data Import Handler had been running
>> > for a while retrieving only about 15 records from the database. Both
>> in
>> > development env (windows) and linux machine it took about 3 mins.
>> >
>> > The query has been changed and we are now trying to retrieve about 10
>> > million records. We do expect the time to increase.
>> >
>> > With the new query the time taken on windows machine is consistently
>> around
>> > 40 mins. While the DIH is running queries slow down i.e. a query that
>> > typically took 60 msec takes 100 msec.
>> >
>> > The time taken on linux machine is consistently around 2.5 hours. While
>> the
>> > DIH is running queries take about 200  to 400 msec.
>> >
>> > Thanks!
>> >
>> > On Mon, Feb 1, 2016 at 8:45 PM, Erick Erickson 
>> > wrote:
>> >
>> >> What happens if you run just the SQL query from the
>> >> windows box and from the linux box? Is there any chance
>> >> that somehow the connection from the linux box is
>> >> just slower?
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Mon, Feb 1, 2016 at 6:36 PM, Alexandre Rafalovitch
>> >>  wrote:
>> >> > What are you importing from? Is the source and Solr machine collocated
>> >> > in the same fashion on dev and prod?
>> >> >
>> >> > Have you tried running this on a Linux dev machine? Perhaps your prod
>> >> > machine is loaded much more than a dev.
>> >> >
>> >> > Regards,
>> >> >Alex.
>> >> > 
>> >> > Newsletter and resources for Solr beginners and intermediates:
>> >> > http://www.solr-start.com/
>> >> >
>> >> >
>> >> > On 2 February 2016 at 13:21, Troy Edwards 
>> >> wrote:
>> >> >> We have a windows development machine on which the Data Import
>> Handler
>> >> >> consistently takes about 40 mins to finish. Queries run fine. JVM
>> >> memory is
>> >> >> 2 GB per node.
>> >> >>
>> >> >> But on a linux machine it consistently takes about 2.5 hours. The
>> >> queries
>> >> >> also run slower. JVM memory

Re: Data Import Handler takes different time on different machines

2016-02-02 Thread Troy Edwards

That is help!

Thank you for the thoughts.


On Tue, Feb 2, 2016 at 12:17 PM, Erick Erickson 
wrote:

> Scratch that installation and start over?
>
> Really, it sounds like something is fundamentally messed up with the
> Linux install. Perhaps something as simple as file paths, or you have
> old jars hanging around that are mis-matched. Or someone manually
> deleted files from the Solr install. Or your disk filled up. Or
>
> How sure are you that the linux setup was done properly?
>
> Not much help I know,
> Erick
>
> On Tue, Feb 2, 2016 at 10:11 AM, Troy Edwards 
> wrote:
> > Rerunning the Data Import Handler again on the the linux machine has
> > started producing some errors and warnings:
> >
> > On the node on which DIH was started:
> >
> > WARN SolrWriter Error creating document : SolrInputDocument
> >
> > org.apache.solr.common.SolrException: No registered leader was found
> > after waiting for 4000ms , collection: collectionmain slice: shard1
> >
> >
> >
> > On the second node:
> >
> > WARN ReplicationHandler Exception while writing response for params:
> >
> command=filecontent=true=1047=/replication=filestream=_1oo_Lucene50_0.tip
> >
> > java.nio.file.NoSuchFileException:
> >
> /var/solr/data/collectionmain_shard2_replica1/data/index/_1oo_Lucene50_0.tip
> >
> >
> > ERROR
> >
> > Index fetch failed :org.apache.solr.common.SolrException: Unable to
> > download _169.si completely. Downloaded 0!=466
> >
> >
> > ReplicationHandler Index fetch failed
> > :org.apache.solr.common.SolrException: Unable to download _169.si
> > completely. Downloaded 0!=466
> >
> > WARN
> > IndexFetcher File _1pd_Lucene50_0.tim did not match. expected checksum is
> > 3549855722 and actual is checksum 2062372352. expected length is 72522
> and
> > actual length is 39227
> >
> > WARN UpdateLog Log replay finished. recoveryInfo=RecoveryInfo{adds=840638
> > deletes=0 deleteByQuery=0 errors=0 positionOfStart=554264}
> >
> >
> > Any suggestions about this?
> >
> > Thanks
> >
> > On Mon, Feb 1, 2016 at 10:03 PM, Erick Erickson  >
> > wrote:
> >
> >> The first thing I'd be looking at is how I the JDBC batch size compares
> >> between the two machines.
> >>
> >> AFAIK, Solr shouldn't notice the difference, and since a large majority
> >> of the development is done on Linux-based systems, I'd be surprised if
> >> this was worse than Windows, which would lead me to the one thing that
> >> is definitely different between the two: Your JDBC driver and its
> settings.
> >> At least that's where I'd look first.
> >>
> >> If nothing immediate pops up, I'd probably write a small driver program
> to
> >> just access the database from the two machines and process your 10M
> >> records _without_ sending them to Solr and see what the comparison is.
> >>
> >> You can also forgo DIH and do a simple import program via SolrJ. The
> >> advantage here is that the comparison I'm talking about above is
> >> really simple, just comment out the call that sends data to Solr.
> Here's an
> >> example...
> >>
> >> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Feb 1, 2016 at 7:34 PM, Troy Edwards 
> >> wrote:
> >> > Sorry, I should explain further. The Data Import Handler had been
> running
> >> > for a while retrieving only about 15 records from the database.
> Both
> >> in
> >> > development env (windows) and linux machine it took about 3 mins.
> >> >
> >> > The query has been changed and we are now trying to retrieve about 10
> >> > million records. We do expect the time to increase.
> >> >
> >> > With the new query the time taken on windows machine is consistently
> >> around
> >> > 40 mins. While the DIH is running queries slow down i.e. a query that
> >> > typically took 60 msec takes 100 msec.
> >> >
> >> > The time taken on linux machine is consistently around 2.5 hours.
> While
> >> the
> >> > DIH is running queries take about 200  to 400 msec.
> >> >
> >> > Thanks!
> >> >
> >> > On Mon, Feb 1, 2016 at 8:45 PM, Erick Erickson <
> erickerick...@gmail.com>
> >> > wrote:
> >> >
> >> >> What happens if you run just the SQL query from the
> >> >> windows box and from the linux box? Is there any chance
> >> >> that somehow the connection from the linux box is
> >> >> just slower?
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Mon, Feb 1, 2016 at 6:36 PM, Alexandre Rafalovitch
> >> >>  wrote:
> >> >> > What are you importing from? Is the source and Solr machine
> collocated
> >> >> > in the same fashion on dev and prod?
> >> >> >
> >> >> > Have you tried running this on a Linux dev machine? Perhaps your
> prod
> >> >> > machine is loaded much more than a dev.
> >> >> >
> >> >> > Regards,
> >> >> >Alex.
> >> >> > 
> >> >> > Newsletter and resources for Solr beginners and intermediates:
> >> >> > http://www.solr-start.com/
> >> >> >
> >> >> >
> >> >> >

Re: Data Import Handler takes different time on different machines

2016-02-02 Thread Troy Edwards

Rerunning the Data Import Handler again on the the linux machine has
started producing some errors and warnings:

On the node on which DIH was started:

WARN SolrWriter Error creating document : SolrInputDocument

org.apache.solr.common.SolrException: No registered leader was found
after waiting for 4000ms , collection: collectionmain slice: shard1



On the second node:

WARN ReplicationHandler Exception while writing response for params:
command=filecontent=true=1047=/replication=filestream=_1oo_Lucene50_0.tip

java.nio.file.NoSuchFileException:
/var/solr/data/collectionmain_shard2_replica1/data/index/_1oo_Lucene50_0.tip


ERROR

Index fetch failed :org.apache.solr.common.SolrException: Unable to
download _169.si completely. Downloaded 0!=466


ReplicationHandler Index fetch failed
:org.apache.solr.common.SolrException: Unable to download _169.si
completely. Downloaded 0!=466

WARN
IndexFetcher File _1pd_Lucene50_0.tim did not match. expected checksum is
3549855722 and actual is checksum 2062372352. expected length is 72522 and
actual length is 39227

WARN UpdateLog Log replay finished. recoveryInfo=RecoveryInfo{adds=840638
deletes=0 deleteByQuery=0 errors=0 positionOfStart=554264}


Any suggestions about this?

Thanks

On Mon, Feb 1, 2016 at 10:03 PM, Erick Erickson 
wrote:

> The first thing I'd be looking at is how I the JDBC batch size compares
> between the two machines.
>
> AFAIK, Solr shouldn't notice the difference, and since a large majority
> of the development is done on Linux-based systems, I'd be surprised if
> this was worse than Windows, which would lead me to the one thing that
> is definitely different between the two: Your JDBC driver and its settings.
> At least that's where I'd look first.
>
> If nothing immediate pops up, I'd probably write a small driver program to
> just access the database from the two machines and process your 10M
> records _without_ sending them to Solr and see what the comparison is.
>
> You can also forgo DIH and do a simple import program via SolrJ. The
> advantage here is that the comparison I'm talking about above is
> really simple, just comment out the call that sends data to Solr. Here's an
> example...
>
> https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
> On Mon, Feb 1, 2016 at 7:34 PM, Troy Edwards 
> wrote:
> > Sorry, I should explain further. The Data Import Handler had been running
> > for a while retrieving only about 15 records from the database. Both
> in
> > development env (windows) and linux machine it took about 3 mins.
> >
> > The query has been changed and we are now trying to retrieve about 10
> > million records. We do expect the time to increase.
> >
> > With the new query the time taken on windows machine is consistently
> around
> > 40 mins. While the DIH is running queries slow down i.e. a query that
> > typically took 60 msec takes 100 msec.
> >
> > The time taken on linux machine is consistently around 2.5 hours. While
> the
> > DIH is running queries take about 200  to 400 msec.
> >
> > Thanks!
> >
> > On Mon, Feb 1, 2016 at 8:45 PM, Erick Erickson 
> > wrote:
> >
> >> What happens if you run just the SQL query from the
> >> windows box and from the linux box? Is there any chance
> >> that somehow the connection from the linux box is
> >> just slower?
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Feb 1, 2016 at 6:36 PM, Alexandre Rafalovitch
> >>  wrote:
> >> > What are you importing from? Is the source and Solr machine collocated
> >> > in the same fashion on dev and prod?
> >> >
> >> > Have you tried running this on a Linux dev machine? Perhaps your prod
> >> > machine is loaded much more than a dev.
> >> >
> >> > Regards,
> >> >Alex.
> >> > 
> >> > Newsletter and resources for Solr beginners and intermediates:
> >> > http://www.solr-start.com/
> >> >
> >> >
> >> > On 2 February 2016 at 13:21, Troy Edwards 
> >> wrote:
> >> >> We have a windows development machine on which the Data Import
> Handler
> >> >> consistently takes about 40 mins to finish. Queries run fine. JVM
> >> memory is
> >> >> 2 GB per node.
> >> >>
> >> >> But on a linux machine it consistently takes about 2.5 hours. The
> >> queries
> >> >> also run slower. JVM memory here is also 2 GB per node.
> >> >>
> >> >> How should I go about analyzing and tuning the linux machine?
> >> >>
> >> >> Thanks
> >>
>

Re: Data Import Handler takes different time on different machines

2016-02-01 Thread Erick Erickson

What happens if you run just the SQL query from the
windows box and from the linux box? Is there any chance
that somehow the connection from the linux box is
just slower?

Best,
Erick

On Mon, Feb 1, 2016 at 6:36 PM, Alexandre Rafalovitch
 wrote:
> What are you importing from? Is the source and Solr machine collocated
> in the same fashion on dev and prod?
>
> Have you tried running this on a Linux dev machine? Perhaps your prod
> machine is loaded much more than a dev.
>
> Regards,
>Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 2 February 2016 at 13:21, Troy Edwards  wrote:
>> We have a windows development machine on which the Data Import Handler
>> consistently takes about 40 mins to finish. Queries run fine. JVM memory is
>> 2 GB per node.
>>
>> But on a linux machine it consistently takes about 2.5 hours. The queries
>> also run slower. JVM memory here is also 2 GB per node.
>>
>> How should I go about analyzing and tuning the linux machine?
>>
>> Thanks

Re: Data Import Handler takes different time on different machines

2016-02-01 Thread Alexandre Rafalovitch

What are you importing from? Is the source and Solr machine collocated
in the same fashion on dev and prod?

Have you tried running this on a Linux dev machine? Perhaps your prod
machine is loaded much more than a dev.

Regards,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/

On 2 February 2016 at 13:21, Troy Edwards  wrote:
> We have a windows development machine on which the Data Import Handler
> consistently takes about 40 mins to finish. Queries run fine. JVM memory is
> 2 GB per node.
>
> But on a linux machine it consistently takes about 2.5 hours. The queries
> also run slower. JVM memory here is also 2 GB per node.
>
> How should I go about analyzing and tuning the linux machine?
>
> Thanks

Re: Data Import Handler takes different time on different machines

2016-02-01 Thread Erick Erickson

The first thing I'd be looking at is how I the JDBC batch size compares
between the two machines.

AFAIK, Solr shouldn't notice the difference, and since a large majority
of the development is done on Linux-based systems, I'd be surprised if
this was worse than Windows, which would lead me to the one thing that
is definitely different between the two: Your JDBC driver and its settings.
At least that's where I'd look first.

If nothing immediate pops up, I'd probably write a small driver program to
just access the database from the two machines and process your 10M
records _without_ sending them to Solr and see what the comparison is.

You can also forgo DIH and do a simple import program via SolrJ. The
advantage here is that the comparison I'm talking about above is
really simple, just comment out the call that sends data to Solr. Here's an
example...

https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/

Best,
Erick

On Mon, Feb 1, 2016 at 7:34 PM, Troy Edwards  wrote:
> Sorry, I should explain further. The Data Import Handler had been running
> for a while retrieving only about 15 records from the database. Both in
> development env (windows) and linux machine it took about 3 mins.
>
> The query has been changed and we are now trying to retrieve about 10
> million records. We do expect the time to increase.
>
> With the new query the time taken on windows machine is consistently around
> 40 mins. While the DIH is running queries slow down i.e. a query that
> typically took 60 msec takes 100 msec.
>
> The time taken on linux machine is consistently around 2.5 hours. While the
> DIH is running queries take about 200  to 400 msec.
>
> Thanks!
>
> On Mon, Feb 1, 2016 at 8:45 PM, Erick Erickson 
> wrote:
>
>> What happens if you run just the SQL query from the
>> windows box and from the linux box? Is there any chance
>> that somehow the connection from the linux box is
>> just slower?
>>
>> Best,
>> Erick
>>
>> On Mon, Feb 1, 2016 at 6:36 PM, Alexandre Rafalovitch
>>  wrote:
>> > What are you importing from? Is the source and Solr machine collocated
>> > in the same fashion on dev and prod?
>> >
>> > Have you tried running this on a Linux dev machine? Perhaps your prod
>> > machine is loaded much more than a dev.
>> >
>> > Regards,
>> >Alex.
>> > 
>> > Newsletter and resources for Solr beginners and intermediates:
>> > http://www.solr-start.com/
>> >
>> >
>> > On 2 February 2016 at 13:21, Troy Edwards 
>> wrote:
>> >> We have a windows development machine on which the Data Import Handler
>> >> consistently takes about 40 mins to finish. Queries run fine. JVM
>> memory is
>> >> 2 GB per node.
>> >>
>> >> But on a linux machine it consistently takes about 2.5 hours. The
>> queries
>> >> also run slower. JVM memory here is also 2 GB per node.
>> >>
>> >> How should I go about analyzing and tuning the linux machine?
>> >>
>> >> Thanks
>>

Re: Data Import Handler takes different time on different machines

2016-02-01 Thread Troy Edwards

Sorry, I should explain further. The Data Import Handler had been running
for a while retrieving only about 15 records from the database. Both in
development env (windows) and linux machine it took about 3 mins.

The query has been changed and we are now trying to retrieve about 10
million records. We do expect the time to increase.

With the new query the time taken on windows machine is consistently around
40 mins. While the DIH is running queries slow down i.e. a query that
typically took 60 msec takes 100 msec.

The time taken on linux machine is consistently around 2.5 hours. While the
DIH is running queries take about 200  to 400 msec.

Thanks!

On Mon, Feb 1, 2016 at 8:45 PM, Erick Erickson 
wrote:

> What happens if you run just the SQL query from the
> windows box and from the linux box? Is there any chance
> that somehow the connection from the linux box is
> just slower?
>
> Best,
> Erick
>
> On Mon, Feb 1, 2016 at 6:36 PM, Alexandre Rafalovitch
>  wrote:
> > What are you importing from? Is the source and Solr machine collocated
> > in the same fashion on dev and prod?
> >
> > Have you tried running this on a Linux dev machine? Perhaps your prod
> > machine is loaded much more than a dev.
> >
> > Regards,
> >Alex.
> > 
> > Newsletter and resources for Solr beginners and intermediates:
> > http://www.solr-start.com/
> >
> >
> > On 2 February 2016 at 13:21, Troy Edwards 
> wrote:
> >> We have a windows development machine on which the Data Import Handler
> >> consistently takes about 40 mins to finish. Queries run fine. JVM
> memory is
> >> 2 GB per node.
> >>
> >> But on a linux machine it consistently takes about 2.5 hours. The
> queries
> >> also run slower. JVM memory here is also 2 GB per node.
> >>
> >> How should I go about analyzing and tuning the linux machine?
> >>
> >> Thanks
>

RE: Data Import Handler - Multivalued fields - splitBy

2015-12-04 Thread Dyer, James

Brian,

Be sure to have...

transformer="RegexTransformer"

...in your  tag.  It’s the RegexTransformer class that looks for 
"splitBy".

See https://wiki.apache.org/solr/DataImportHandler#RegexTransformer for more 
information.

James Dyer
Ingram Content Group


-Original Message-
From: Brian Narsi [mailto:bnars...@gmail.com] 
Sent: Friday, December 04, 2015 3:10 PM
To: solr-user@lucene.apache.org
Subject: Data Import Handler - Multivalued fields - splitBy

I have the following:





I believe I had the following working (splitting on pipe delimited)



But it does not work now.



In-fact now I have even tried



But I cannot get the values to split into an array.

Any thoughts/suggestions what may be wrong?

Thanks,

Re: Data Import Handler - Multivalued fields - splitBy

2015-12-04 Thread Brian Narsi

That was it! Thank you!

On Fri, Dec 4, 2015 at 3:13 PM, Dyer, James 
wrote:

> Brian,
>
> Be sure to have...
>
> transformer="RegexTransformer"
>
> ...in your  tag.  It’s the RegexTransformer class that looks
> for "splitBy".
>
> See https://wiki.apache.org/solr/DataImportHandler#RegexTransformer for
> more information.
>
> James Dyer
> Ingram Content Group
>
>
> -Original Message-
> From: Brian Narsi [mailto:bnars...@gmail.com]
> Sent: Friday, December 04, 2015 3:10 PM
> To: solr-user@lucene.apache.org
> Subject: Data Import Handler - Multivalued fields - splitBy
>
> I have the following:
>
>  required="true" multiValued="true" />
>
>
>
> I believe I had the following working (splitting on pipe delimited)
>
> 
>
> But it does not work now.
>
>
>
> In-fact now I have even tried
>
> 
>
> But I cannot get the values to split into an array.
>
> Any thoughts/suggestions what may be wrong?
>
> Thanks,
>

Re: Data Import Handler / Backup indexes

2015-11-23 Thread Jeff Wartes

The backup/restore approach in SOLR-5750 and in solrcloud_manager is
really just that - copying the index files.
On backup, it saves your index directories, and on restore, it puts them
in the data dir, moves a pointer for the current index dir, and opens a
new searcher. Both are mostly just wrappers on the proper Solr
replication-handler commands, since Solr already has some lower level APIs
for these operations.

There is a shared filesystem requirement for backup/restore though, which
is to account for the fact that when you make the backup you don’t know
which nodes will need to restore a given shard.

The commands would look something like:

java -jar solrcloud_manager-assembly-1.4.0.jar backupindex -z
zk0.example.com:2181/myapp -c collection1 --dir 
java -jar solrcloud_manager-assembly-1.4.0.jar restoreindex -z
zk0.example.com:2181/myapp -c collection1 --dir 

Or you could restore into a new collection:
java -jar solrcloud_manager-assembly-1.4.0.jar backupindex -z
zk0.example.com:2181/myapp -c collection1 --dir 
java -jar solrcloud_manager-assembly-1.4.0.jar clonecollection -z
zk0.example.com:2181/myapp -c newcollection --fromCollection collection1
java -jar solrcloud_manager-assembly-1.4.0.jar restoreindex -z
zk0.example.com:2181/myapp -c newcollection --dir 
--restoreFrom collection1

If you don’t have a shared filesystem, you can still do the copy
collection route:
java -jar solrcloud_manager-assembly-1.4.0.jar clonecollection -z
zk0.example.com:2181/myapp -c newcollection --fromCollection collection1

java -jar solrcloud_manager-assembly-1.4.0.jar copycollection -z
zk0.example.com:2181/myapp -c newcollection --fromCollection collection1

This creates a new collection with the same settings, (clonecollection)
and triggers a one-shot “replication” into it. (copycollection) Again,
this is just framework for the proper (largely undocumented) Solr API
commands, to work around the lack of a convenient collections-level API
command.

One nice thing about using copy collection is that it can be used to keep
a backup collection up to date, only copying if necessary. Honestly
though, I don’t have as much experience with this use case as KNitin does
in solrcloud-haft, which is why I suggest using an empty collection in the
README right now. If you try that use case with solrcloud_manager, I’d be
interested in your experience. It should work, but you’ll need to disable
the verification with --skipCheck and check manually.

Having said all that though, yes, with your simple use case and small
collection, you can do everything you want with just cp. The easiest way
would be to make a backup copy of your index dir. If you need to restore,
shut down solr, nuke your index dir, and copy the backup in there. You’d
probably need to do this on all nodes at once though, to prevent a
non-leader from coming up and re-syncing with a piece of the index you
hadn’t restored yet.

On 11/21/15, 10:12 PM, "Brian Narsi"  wrote:

>What are the caveats regarding the copy of a collection?
>
>At this time DIH takes only about 10 minutes. So in case of accidental
>delete we can just re-run the DIH. The reason I am thinking about backup
>is
>just in case records are deleted accidentally and the DIH cannot be run
>because the database is unavailable.
>
>Our collection is simple: 2 nodes - 1 collection - 2 shards with 2
>replicas
>each
>
>So a simple copy (cp command) for both the nodes/shards might work for us?
>How do I restore the data back?
>
>
>
>On Tue, Nov 17, 2015 at 4:56 PM, Jeff Wartes 
>wrote:
>
>>
>> https://github.com/whitepages/solrcloud_manager supports 5.x, and I
>>added
>> some backup/restore functionality similar to SOLR-5750 in the last
>> release.
>> Like SOLR-5750, this backup strategy requires a shared filesystem, but
>> note that unlike SOLR-5750, I haven’t yet added any backup functionality
>> for the contents of ZK. I’m currently working on some parts of that.
>>
>>
>> Making a copy of a collection is supported too, with some caveats.
>>
>>
>> On 11/17/15, 10:20 AM, "Brian Narsi"  wrote:
>>
>> >Sorry I forgot to mention that we are using SolrCloud 5.1.0.
>> >
>> >
>> >
>> >On Tue, Nov 17, 2015 at 12:09 PM, KNitin  wrote:
>> >
>> >> afaik Data import handler does not offer backups. You can try using
>>the
>> >> replication handler to backup data as you wish to any custom end
>>point.
>> >>
>> >> You can also try out : https://github.com/bloomreach/solrcloud-haft.
>> >>This
>> >> helps backup solr indices across clusters.
>> >>
>> >> On Tue, Nov 17, 2015 at 7:08 AM, Brian Narsi 
>> wrote:
>> >>
>> >> > I am using Data Import Handler to retrieve data from a database
>>with
>> >> >
>> >> > full-import, clean = true, commit = true and optimize = true
>> >> >
>> >> > This has always worked correctly without any errors.
>> >> >
>> >> > But just to be on the safe side, I am

Re: Data Import Handler / Backup indexes

2015-11-22 Thread Erick Erickson

These are just Lucene indexes. There's the Cloud backup and restore
that is being worked on.

But if the index is static (i.e. not being indexed to), simply copying
the data/index (well, actually the whole data index and subdirs)
directory will backup and restore it. Copying the index directory back
(I'd have Solr shut down when copying back) would restore the index.

Best,
Erick

On Sat, Nov 21, 2015 at 10:12 PM, Brian Narsi  wrote:
> What are the caveats regarding the copy of a collection?
>
> At this time DIH takes only about 10 minutes. So in case of accidental
> delete we can just re-run the DIH. The reason I am thinking about backup is
> just in case records are deleted accidentally and the DIH cannot be run
> because the database is unavailable.
>
> Our collection is simple: 2 nodes - 1 collection - 2 shards with 2 replicas
> each
>
> So a simple copy (cp command) for both the nodes/shards might work for us?
> How do I restore the data back?
>
>
>
> On Tue, Nov 17, 2015 at 4:56 PM, Jeff Wartes  wrote:
>
>>
>> https://github.com/whitepages/solrcloud_manager supports 5.x, and I added
>> some backup/restore functionality similar to SOLR-5750 in the last
>> release.
>> Like SOLR-5750, this backup strategy requires a shared filesystem, but
>> note that unlike SOLR-5750, I haven’t yet added any backup functionality
>> for the contents of ZK. I’m currently working on some parts of that.
>>
>>
>> Making a copy of a collection is supported too, with some caveats.
>>
>>
>> On 11/17/15, 10:20 AM, "Brian Narsi"  wrote:
>>
>> >Sorry I forgot to mention that we are using SolrCloud 5.1.0.
>> >
>> >
>> >
>> >On Tue, Nov 17, 2015 at 12:09 PM, KNitin  wrote:
>> >
>> >> afaik Data import handler does not offer backups. You can try using the
>> >> replication handler to backup data as you wish to any custom end point.
>> >>
>> >> You can also try out : https://github.com/bloomreach/solrcloud-haft.
>> >>This
>> >> helps backup solr indices across clusters.
>> >>
>> >> On Tue, Nov 17, 2015 at 7:08 AM, Brian Narsi 
>> wrote:
>> >>
>> >> > I am using Data Import Handler to retrieve data from a database with
>> >> >
>> >> > full-import, clean = true, commit = true and optimize = true
>> >> >
>> >> > This has always worked correctly without any errors.
>> >> >
>> >> > But just to be on the safe side, I am thinking that we should do a
>> >>backup
>> >> > before initiating Data Import Handler. And just in case something
>> >>happens
>> >> > restore the backup.
>> >> >
>> >> > Can backup be done automatically (before initiating Data Import
>> >>Handler)?
>> >> >
>> >> > Thanks
>> >> >
>> >>
>>
>>

Re: Data Import Handler / Backup indexes

2015-11-21 Thread Brian Narsi

What are the caveats regarding the copy of a collection?

At this time DIH takes only about 10 minutes. So in case of accidental
delete we can just re-run the DIH. The reason I am thinking about backup is
just in case records are deleted accidentally and the DIH cannot be run
because the database is unavailable.

Our collection is simple: 2 nodes - 1 collection - 2 shards with 2 replicas
each

So a simple copy (cp command) for both the nodes/shards might work for us?
How do I restore the data back?



On Tue, Nov 17, 2015 at 4:56 PM, Jeff Wartes  wrote:

>
> https://github.com/whitepages/solrcloud_manager supports 5.x, and I added
> some backup/restore functionality similar to SOLR-5750 in the last
> release.
> Like SOLR-5750, this backup strategy requires a shared filesystem, but
> note that unlike SOLR-5750, I haven’t yet added any backup functionality
> for the contents of ZK. I’m currently working on some parts of that.
>
>
> Making a copy of a collection is supported too, with some caveats.
>
>
> On 11/17/15, 10:20 AM, "Brian Narsi"  wrote:
>
> >Sorry I forgot to mention that we are using SolrCloud 5.1.0.
> >
> >
> >
> >On Tue, Nov 17, 2015 at 12:09 PM, KNitin  wrote:
> >
> >> afaik Data import handler does not offer backups. You can try using the
> >> replication handler to backup data as you wish to any custom end point.
> >>
> >> You can also try out : https://github.com/bloomreach/solrcloud-haft.
> >>This
> >> helps backup solr indices across clusters.
> >>
> >> On Tue, Nov 17, 2015 at 7:08 AM, Brian Narsi 
> wrote:
> >>
> >> > I am using Data Import Handler to retrieve data from a database with
> >> >
> >> > full-import, clean = true, commit = true and optimize = true
> >> >
> >> > This has always worked correctly without any errors.
> >> >
> >> > But just to be on the safe side, I am thinking that we should do a
> >>backup
> >> > before initiating Data Import Handler. And just in case something
> >>happens
> >> > restore the backup.
> >> >
> >> > Can backup be done automatically (before initiating Data Import
> >>Handler)?
> >> >
> >> > Thanks
> >> >
> >>
>
>

Re: Data Import Handler / Backup indexes

2015-11-17 Thread Brian Narsi

Sorry I forgot to mention that we are using SolrCloud 5.1.0.



On Tue, Nov 17, 2015 at 12:09 PM, KNitin  wrote:

> afaik Data import handler does not offer backups. You can try using the
> replication handler to backup data as you wish to any custom end point.
>
> You can also try out : https://github.com/bloomreach/solrcloud-haft.  This
> helps backup solr indices across clusters.
>
> On Tue, Nov 17, 2015 at 7:08 AM, Brian Narsi  wrote:
>
> > I am using Data Import Handler to retrieve data from a database with
> >
> > full-import, clean = true, commit = true and optimize = true
> >
> > This has always worked correctly without any errors.
> >
> > But just to be on the safe side, I am thinking that we should do a backup
> > before initiating Data Import Handler. And just in case something happens
> > restore the backup.
> >
> > Can backup be done automatically (before initiating Data Import Handler)?
> >
> > Thanks
> >
>

Re: Data Import Handler / Backup indexes

2015-11-17 Thread KNitin

afaik Data import handler does not offer backups. You can try using the
replication handler to backup data as you wish to any custom end point.

You can also try out : https://github.com/bloomreach/solrcloud-haft.  This
helps backup solr indices across clusters.

On Tue, Nov 17, 2015 at 7:08 AM, Brian Narsi  wrote:

> I am using Data Import Handler to retrieve data from a database with
>
> full-import, clean = true, commit = true and optimize = true
>
> This has always worked correctly without any errors.
>
> But just to be on the safe side, I am thinking that we should do a backup
> before initiating Data Import Handler. And just in case something happens
> restore the backup.
>
> Can backup be done automatically (before initiating Data Import Handler)?
>
> Thanks
>

Re: Data Import Handler / Backup indexes

2015-11-17 Thread Jeff Wartes

https://github.com/whitepages/solrcloud_manager supports 5.x, and I added
some backup/restore functionality similar to SOLR-5750 in the last
release. 
Like SOLR-5750, this backup strategy requires a shared filesystem, but
note that unlike SOLR-5750, I haven’t yet added any backup functionality
for the contents of ZK. I’m currently working on some parts of that.

Making a copy of a collection is supported too, with some caveats.

On 11/17/15, 10:20 AM, "Brian Narsi"  wrote:

>Sorry I forgot to mention that we are using SolrCloud 5.1.0.
>
>
>
>On Tue, Nov 17, 2015 at 12:09 PM, KNitin  wrote:
>
>> afaik Data import handler does not offer backups. You can try using the
>> replication handler to backup data as you wish to any custom end point.
>>
>> You can also try out : https://github.com/bloomreach/solrcloud-haft.
>>This
>> helps backup solr indices across clusters.
>>
>> On Tue, Nov 17, 2015 at 7:08 AM, Brian Narsi  wrote:
>>
>> > I am using Data Import Handler to retrieve data from a database with
>> >
>> > full-import, clean = true, commit = true and optimize = true
>> >
>> > This has always worked correctly without any errors.
>> >
>> > But just to be on the safe side, I am thinking that we should do a
>>backup
>> > before initiating Data Import Handler. And just in case something
>>happens
>> > restore the backup.
>> >
>> > Can backup be done automatically (before initiating Data Import
>>Handler)?
>> >
>> > Thanks
>> >
>>

Re: Data import handler not indexing all data

2015-11-07 Thread Alexandre Rafalovitch

Just to get the paranoid option out of the way, is 'id' actually the
column that has unique ids in your database? If you do "select
distinct id from imdb.director" - how many items do you get?

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 7 November 2015 at 18:21, Yangrui Guo  wrote:
> Hello
>
> I'm being troubled by solr's data import handler. My solr version is 5.3.1
> and mysql is 5.5. I tried to index imdb data but found solr only partially
> indexed. I ran "SELECT DISTINCT COUNT(*) FROM imdb.director" and the query
> result was 1636549. However DIH only fetched and indexed 287041 rows. I
> didn't see any error in the log. Why was this happening?
>
> Here's my data-config.xml
>
> 
>  url="jdbc:mysql://localhost:3306/imdb" user="root" password="password" />
> 
> 
> 
> 
> 
> 
> 
>
> Yangrui Guo

Re: Data import handler not indexing all data

2015-11-07 Thread Yangrui Guo

Hi thanks for the continued support. I'm really worried as my project
deadline is near. It was 1636549 in MySQL vs 287041 in Solr. I put select
distinct in the beginning of the query because IMDB doesn't have a table
for cast & crew. It puts movie and person and their roles into one huge
table 'cast_info'. Hence there are multiple rows for a director, one row
per his movie.

On Saturday, November 7, 2015, Alexandre Rafalovitch 
wrote:

> Just to get the paranoid option out of the way, is 'id' actually the
> column that has unique ids in your database? If you do "select
> distinct id from imdb.director" - how many items do you get?
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 7 November 2015 at 18:21, Yangrui Guo  > wrote:
> > Hello
> >
> > I'm being troubled by solr's data import handler. My solr version is
> 5.3.1
> > and mysql is 5.5. I tried to index imdb data but found solr only
> partially
> > indexed. I ran "SELECT DISTINCT COUNT(*) FROM imdb.director" and the
> query
> > result was 1636549. However DIH only fetched and indexed 287041 rows. I
> > didn't see any error in the log. Why was this happening?
> >
> > Here's my data-config.xml
> >
> > 
> >  > url="jdbc:mysql://localhost:3306/imdb" user="root" password="password" />
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >
> > Yangrui Guo
>

Re: Data import handler not indexing all data

2015-11-07 Thread Alexandre Rafalovitch

That's not quite the question I asked. Do a distinct on 'id' only in
the database itself. If your ids are NOT unique, you need to create a
composite or a virtual id for Solr. Because whatever your
solrconfig.xml say is uniqueKey will be used to deduplicate the
documents. If you have 10 documents with the same id value, only one
will be in the final Solr.

I am not saying that's where the problem is, DIH is fiddly. But just
get that out of the way.

If that's not the case, you may need to isolate which documents are
failing. The easiest way to do so is probably to index a smaller
subset of records, say 1000. Pick a condition in your SQL to do so
(e.g. id value range). Then, see how many made it into Solr. If not
all 1000, export the list of IDs from SQL, then a list of IDs from
Solr (use CSV format and just fl=id). Sort both, compare, see what ids
are missing. Look what is strange about those documents as opposed to
the documents that did make it into Solr. Try to push one of those
missing documents explicitly into Solr by either modifying SQL query
in DIH or as CSV or whatever.

Good luck,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/

On 7 November 2015 at 19:07, Yangrui Guo  wrote:
> Hi thanks for the continued support. I'm really worried as my project
> deadline is near. It was 1636549 in MySQL vs 287041 in Solr. I put select
> distinct in the beginning of the query because IMDB doesn't have a table
> for cast & crew. It puts movie and person and their roles into one huge
> table 'cast_info'. Hence there are multiple rows for a director, one row
> per his movie.
>
> On Saturday, November 7, 2015, Alexandre Rafalovitch 
> wrote:
>
>> Just to get the paranoid option out of the way, is 'id' actually the
>> column that has unique ids in your database? If you do "select
>> distinct id from imdb.director" - how many items do you get?
>>
>> Regards,
>>Alex.
>> 
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 7 November 2015 at 18:21, Yangrui Guo > > wrote:
>> > Hello
>> >
>> > I'm being troubled by solr's data import handler. My solr version is
>> 5.3.1
>> > and mysql is 5.5. I tried to index imdb data but found solr only
>> partially
>> > indexed. I ran "SELECT DISTINCT COUNT(*) FROM imdb.director" and the
>> query
>> > result was 1636549. However DIH only fetched and indexed 287041 rows. I
>> > didn't see any error in the log. Why was this happening?
>> >
>> > Here's my data-config.xml
>> >
>> > 
>> > > > url="jdbc:mysql://localhost:3306/imdb" user="root" password="password" />
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> > 
>> >
>> > Yangrui Guo
>>

Re: Data import handler not indexing all data

2015-11-07 Thread Yangrui Guo

Yes the id is unique. If I only select distinct id,count(id) I get the same
results. However I found this is more likely a MySQL issue. I created a new
table called director1 and ran query "insert into director1 select * from
director" I got only 287041 results inserted, which was the same as Solr. I
don't know why the same query is causing two different results.

On Saturday, November 7, 2015, Alexandre Rafalovitch 
wrote:

> That's not quite the question I asked. Do a distinct on 'id' only in
> the database itself. If your ids are NOT unique, you need to create a
> composite or a virtual id for Solr. Because whatever your
> solrconfig.xml say is uniqueKey will be used to deduplicate the
> documents. If you have 10 documents with the same id value, only one
> will be in the final Solr.
>
> I am not saying that's where the problem is, DIH is fiddly. But just
> get that out of the way.
>
> If that's not the case, you may need to isolate which documents are
> failing. The easiest way to do so is probably to index a smaller
> subset of records, say 1000. Pick a condition in your SQL to do so
> (e.g. id value range). Then, see how many made it into Solr. If not
> all 1000, export the list of IDs from SQL, then a list of IDs from
> Solr (use CSV format and just fl=id). Sort both, compare, see what ids
> are missing. Look what is strange about those documents as opposed to
> the documents that did make it into Solr. Try to push one of those
> missing documents explicitly into Solr by either modifying SQL query
> in DIH or as CSV or whatever.
>
> Good luck,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 7 November 2015 at 19:07, Yangrui Guo  > wrote:
> > Hi thanks for the continued support. I'm really worried as my project
> > deadline is near. It was 1636549 in MySQL vs 287041 in Solr. I put select
> > distinct in the beginning of the query because IMDB doesn't have a table
> > for cast & crew. It puts movie and person and their roles into one huge
> > table 'cast_info'. Hence there are multiple rows for a director, one row
> > per his movie.
> >
> > On Saturday, November 7, 2015, Alexandre Rafalovitch  >
> > wrote:
> >
> >> Just to get the paranoid option out of the way, is 'id' actually the
> >> column that has unique ids in your database? If you do "select
> >> distinct id from imdb.director" - how many items do you get?
> >>
> >> Regards,
> >>Alex.
> >> 
> >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> >> http://www.solr-start.com/
> >>
> >>
> >> On 7 November 2015 at 18:21, Yangrui Guo  
> >> > wrote:
> >> > Hello
> >> >
> >> > I'm being troubled by solr's data import handler. My solr version is
> >> 5.3.1
> >> > and mysql is 5.5. I tried to index imdb data but found solr only
> >> partially
> >> > indexed. I ran "SELECT DISTINCT COUNT(*) FROM imdb.director" and the
> >> query
> >> > result was 1636549. However DIH only fetched and indexed 287041 rows.
> I
> >> > didn't see any error in the log. Why was this happening?
> >> >
> >> > Here's my data-config.xml
> >> >
> >> > 
> >> >  >> > url="jdbc:mysql://localhost:3306/imdb" user="root"
> password="password" />
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> >
> >> > Yangrui Guo
> >>
>

RE: Data Import Handler Stays Idle

2015-08-28 Thread Allison, Timothy B.

Only a month late to respond, and the response likely won't help.

I agree with Shawn that Tika can be a memory hog.  I try to leave 1GB per 
thread, but your mileage will vary dramatically depending on your docs.  I'd 
expect that you'd get an OOM, though, somewhere...

There have been rare bugs in various parsers, including the PDFParser, in 
various versions of Tika that cause permanent hangs.  I haven't experimented 
with DIH and known trigger files, but I suspect you'd get the behavior that 
you're seeing if this were to happen.

So, short of rolling your own ETL'r in lieu of DIH or hardening DIH to run tika 
in a different process (tika-server, perhaps -- 
https://issues.apache.org/jira/browse/SOLR-7632) or going big with Hadoop, 
morphlines, etc, your only hope is to upgrade Tika and hope that that was one 
of the bugs that we've already identified and fixed.

If you do go with morphlines...I don't think this has been fixed yet: 
https://github.com/kite-sdk/kite/issues/397

Did you ever figure out what was going wrong?

Best,

 Tim

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Tuesday, July 21, 2015 10:41 AM
To: solr-user@lucene.apache.org
Subject: Re: Data Import Handler Stays Idle

On 7/21/2015 8:17 AM, Paden wrote:
 There are some zip files inside the directory and have been addressed 
 to in the database. I'm thinking those are the one's it's jumping 
 right over. They are not the issue. At least I'm 95% sure. And Shawn 
 if you're still watching I'm sorry I'm using solr-5.1.0.

Have you started Solr with a larger heap than the default 512MB in Solr 5.x?  
Tika can require a lot of memory.  I would have expected there to be 
OutOfMemoryError exceptions in the log if that were the problem, though.

You may need to use the -m option on the startup scripts to increase the max 
heap.  Starting with -m 2g would be a good idea.

Also, seeing the entire multi-line IOException from the log (which may be 
dozens of lines) could be important.

Thanks,
Shawn

RE: Data Import Handler Stays Idle

2015-08-28 Thread Allison, Timothy B.

 There are some zip files inside the directory and have been addressed 
 to in the database. I'm thinking those are the one's it's jumping 
 right over.

With SOLR-7189, which should have kicked in for 5.1, Tika shouldn't skip over 
Zip files, it should process all the contents of those zips and concatenate the 
extracted text into one string.


-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: Tuesday, July 21, 2015 10:41 AM
To: solr-user@lucene.apache.org
Subject: Re: Data Import Handler Stays Idle

On 7/21/2015 8:17 AM, Paden wrote:
 There are some zip files inside the directory and have been addressed 
 to in the database. I'm thinking those are the one's it's jumping 
 right over. They are not the issue. At least I'm 95% sure. And Shawn 
 if you're still watching I'm sorry I'm using solr-5.1.0.

Have you started Solr with a larger heap than the default 512MB in Solr 5.x?  
Tika can require a lot of memory.  I would have expected there to be 
OutOfMemoryError exceptions in the log if that were the problem, though.

You may need to use the -m option on the startup scripts to increase the max 
heap.  Starting with -m 2g would be a good idea.

Also, seeing the entire multi-line IOException from the log (which may be 
dozens of lines) could be important.

Thanks,
Shawn

Re: Data Import Handler Stays Idle

2015-07-21 Thread Paden

Okay. I'm going to run the index again with specifications that you
recommended. This could take a few hours but I will post the entire trace on
that error when it pops up again and I will let you guys know the results of
increasing the heap size. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Stays-Idle-tp4218250p4218382.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data Import Handler Stays Idle

2015-07-21 Thread Paden

Hey shawn when I use the -m 2g command in my script I get the error a 'cannot
open [path]/server/logs/solr.log for reading: No such file or directory' I
do not see how this would affect that. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Stays-Idle-tp4218250p4218389.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data Import Handler Stays Idle

2015-07-21 Thread Paden

There are some zip files inside the directory and have been addressed to in
the database. I'm thinking those are the one's it's jumping right over. They
are not the issue. At least I'm 95% sure. And Shawn if you're still watching
I'm sorry I'm using solr-5.1.0.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Stays-Idle-tp4218250p4218371.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data Import Handler Stays Idle

2015-07-21 Thread Shawn Heisey

On 7/21/2015 8:17 AM, Paden wrote:
 There are some zip files inside the directory and have been addressed to in
 the database. I'm thinking those are the one's it's jumping right over. They
 are not the issue. At least I'm 95% sure. And Shawn if you're still watching
 I'm sorry I'm using solr-5.1.0.

Have you started Solr with a larger heap than the default 512MB in Solr
5.x?  Tika can require a lot of memory.  I would have expected there to
be OutOfMemoryError exceptions in the log if that were the problem, though.

You may need to use the -m option on the startup scripts to increase
the max heap.  Starting with -m 2g would be a good idea.

Also, seeing the entire multi-line IOException from the log (which may
be dozens of lines) could be important.

Thanks,
Shawn

Re: Data Import Handler Stays Idle

2015-07-20 Thread Raja Pothuganti

Number of Ioexceptions , are they equal to un-imported/un processed
documents?

By any chance commit set to false in import request
example:
http://localhost:8983/solr/db/dataimport?command=full-importcommit=false


Thanks
Raja

On 7/20/15, 4:51 PM, Paden rumsey...@gmail.com wrote:

I was consistently checking the logs to see if there were any errors that
would give me any idling. There were no errors except for a few skipped
documents due to some Illegal IOexceptions from Tika but none of those
occurred around the time that solr began idling. A lot of font warnings.
But
again. Nothing but font warnings around time of idling.



--
View this message in context:
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Stays-Idle-tp421825
0p4218260.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data Import Handler Stays Idle

2015-07-20 Thread Paden

Yes the number of unimported matches. No I did not specify false to commit
on any of my dataimporthandler. Since it defaults to true I really didn't
take it into account though. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Stays-Idle-tp4218250p4218262.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data Import Handler Stays Idle

2015-07-20 Thread Shawn Heisey

On 7/20/2015 3:03 PM, Paden wrote:
 I'm currently trying to index about 54,000 files with the Solr Data Import
 Handler and I've got a small problem. It fetches about half (28,289) of the
 54,000 files and it process about 14,146 documents before it stops and just
 stands idle. Here's the status output

 {
   responseHeader: {
 status: 0,
 QTime: 0
   },
   initArgs: [
 defaults,
 [
   config,
   db-data-config.xml,
   update.chain,
   skip-empty
 ]
   ],
   command: status,
   status: idle,
   importResponse: ,
   statusMessages: {
 Time Elapsed: 2:39:53.191,
 Total Requests made to DataSource: 1,
 Total Rows Fetched: 28289,
 Total Documents Processed: 14146,
 Total Documents Skipped: 0,
 Full Dump Started: 2015-07-20 18:19:17
   }
 }

 it has a green arrow next to the header where it says number or documents
 fetched/process but it doesn't say that it's done indexing. It also doesn't
 have the commit line that I've seen on my other core that I indexed about
 290 documents on. This is the second time that I have tried to index these
 files. I swung by the office this last weekend to see how the index was
 going and (I didn't write the numbers down but I guess I should have) I seem
 to remember it being pretty much at this EXACT spot when the dataimport
 handler starting being idle the last time too. Is there some line in the
 solr config that I have to change to actually commit some of the documents.
 That way so it isn't all at once? Is there some doc limit I have reached
 that I don't know exists? Are the PDF's too large and killing tika (and solr
 with it). I'm really kind of stuck here. 

What Solr version are you using, and if you look for the Solr logfile on
the disk, do you see any errors in it?  There may be a few more
questions to ask, but they will depend on the answers to those two.

You may be on to something with the idea of a PDF document that's
killing Tika.

Thanks,
Shawn

Re: Data Import Handler Stays Idle

2015-07-20 Thread Paden

I was consistently checking the logs to see if there were any errors that
would give me any idling. There were no errors except for a few skipped
documents due to some Illegal IOexceptions from Tika but none of those
occurred around the time that solr began idling. A lot of font warnings. But
again. Nothing but font warnings around time of idling. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Stays-Idle-tp4218250p4218260.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data Import Handler Stays Idle

2015-07-20 Thread Raja Pothuganti

Yes the number of unimported matches (with IOExceptions)

What is the IOException about?

On 7/20/15, 5:10 PM, Paden rumsey...@gmail.com wrote:

Yes the number of unimported matches. No I did not specify false to
commit
on any of my dataimporthandler. Since it defaults to true I really didn't
take it into account though.



--
View this message in context:
http://lucene.472066.n3.nabble.com/Data-Import-Handler-Stays-Idle-tp421825
0p4218262.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data Import Handler - reading GET

2015-03-16 Thread Alexandre Rafalovitch

Have you tried? As ${dih.request.foo}?

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 16 March 2015 at 14:51, Kiran J kiranjuni...@gmail.com wrote:
 Hi,

 In data import handler, I can read the clean query parameter using
 ${dih.request.clean} and pass it on to the queries. Is it possible to read
 any query parameter from the URL ? for eg ${foo} ?

 Thanks

RE: Data Import Handler Status

2014-12-04 Thread dhwani2388

Hi,

In SOLR I am fetching DIH status of the core using
/dataimport?command=status. Now the data import is running though the status
URL giving me idle status. Some times its giving me idle status on right
time once data import is completed but some times its giving idle status 1
or 2 seconds early.

Can any one help on this?






--
View this message in context: 
http://lucene.472066.n3.nabble.com/RE-Data-Import-Handler-Status-tp4172590.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data Import Handler Status

2014-12-04 Thread Shawn Heisey

On 12/4/2014 9:18 AM, dhwani2388 wrote:
 In SOLR I am fetching DIH status of the core using
 /dataimport?command=status. Now the data import is running though the status
 URL giving me idle status. Some times its giving me idle status on right
 time once data import is completed but some times its giving idle status 1
 or 2 seconds early.

If the DIH status shows as idle but it's not really idle, then that's a
bug.  What evidence do you have that it's reporting incorrectly?

Thanks,
Shawn

RE: Data Import Handler for CSV file

2014-10-10 Thread Dyer, James

Nabil,

Unfortunately, the out-of-the box functionality for DIH lacks a lot of what the 
csv handler has to offer.  There is a LineEntityProcessor (see 
http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor), but this 
will just output each line in a field called rawLine.  It is up to you to 
then write a Transformer that will split it on commas (or better, use a lib 
like commons-csv to process it).

There is an extension available as an old patch that will give 
LineEntityProcessor the ability to handle delimited and fixed-width files.  
However, you'll need to apply the patch yourself and build DIH from source.   
See https://issues.apache.org/jira/browse/SOLR-2549 .

James Dyer
Ingram Content Group
(615) 213-4311

-Original Message-
From: nabil Kouici [mailto:koui...@yahoo.fr] 
Sent: Thursday, October 09, 2014 4:26 PM
To: solr-user@lucene.apache.org; Ahmet Arslan
Subject: Re: Data Import Handler for CSV file

Hi Ahmet,
 
Thank you for this replay. Agree with you that csv update handler is fast but 
we need always to specify columns in the http request. In addition, I don't 
find documentation how to use csv update from solrj.

Could you please send me an example of DIH to load CSV file?

Regards,
Nabil.


Le Jeudi 9 octobre 2014 21h05, Ahmet Arslan iori...@yahoo.com.INVALID a écrit 
:
 


Hi Nabil,

whats wrong with csv update handler? It is quite fast.

By the way DIH has line entity processor, yes it is doable with existing DIH 
components.

Ahmet



On Thursday, October 9, 2014 9:58 PM, nabil Kouici koui...@yahoo.fr wrote:





Hi All,

Is it possible to have in solr a DIH to load from CSV file. Actually I'm using 
update/csv handler but not responding to my need.

Regards,
NKI.

Re: Data Import Handler for CSV file

2014-10-09 Thread Ahmet Arslan

Hi Nabil,

whats wrong with csv update handler? It is quite fast.

By the way DIH has line entity processor, yes it is doable with existing DIH 
components.

Ahmet
 

On Thursday, October 9, 2014 9:58 PM, nabil Kouici koui...@yahoo.fr wrote:





Hi All,

Is it possible to have in solr a DIH to load from CSV file. Actually I'm using 
update/csv handler but not responding to my need.

Regards,
NKI.

Re: Data Import Handler for CSV file

2014-10-09 Thread nabil Kouici

Hi Ahmet,
 
Thank you for this replay. Agree with you that csv update handler is fast but 
we need always to specify columns in the http request. In addition, I don't 
find documentation how to use csv update from solrj.

Could you please send me an example of DIH to load CSV file?

Regards,
Nabil.


Le Jeudi 9 octobre 2014 21h05, Ahmet Arslan iori...@yahoo.com.INVALID a écrit 
:
 


Hi Nabil,

whats wrong with csv update handler? It is quite fast.

By the way DIH has line entity processor, yes it is doable with existing DIH 
components.

Ahmet



On Thursday, October 9, 2014 9:58 PM, nabil Kouici koui...@yahoo.fr wrote:





Hi All,

Is it possible to have in solr a DIH to load from CSV file. Actually I'm using 
update/csv handler but not responding to my need.

Regards,
NKI.

Re: Data Import Handler for CSV file

2014-10-09 Thread Alexandre Rafalovitch

You could always define the parameters in the solrconfig.XML on a custom
handler. Don't have to pass the same values over and over again.

Regards,
 Alex
On 09/10/2014 5:26 pm, nabil Kouici koui...@yahoo.fr wrote:

 Hi Ahmet,

 Thank you for this replay. Agree with you that csv update handler is fast
 but we need always to specify columns in the http request. In addition, I
 don't find documentation how to use csv update from solrj.

 Could you please send me an example of DIH to load CSV file?

 Regards,
 Nabil.


 Le Jeudi 9 octobre 2014 21h05, Ahmet Arslan iori...@yahoo.com.INVALID a
 écrit :



 Hi Nabil,

 whats wrong with csv update handler? It is quite fast.

 By the way DIH has line entity processor, yes it is doable with existing
 DIH components.

 Ahmet



 On Thursday, October 9, 2014 9:58 PM, nabil Kouici koui...@yahoo.fr
 wrote:





 Hi All,

 Is it possible to have in solr a DIH to load from CSV file. Actually I'm
 using update/csv handler but not responding to my need.

 Regards,
 NKI.

Re: Data Import Handler for CSV file

2014-10-09 Thread Ahmet Arslan

Hi,

I think you can define field names in the first line of csv. Why don't you use 
curl to index csv?

I don't have full working example with DIH but I have following example that 
indexed every line as a separate solr scoument.

You need to add a transformer that splits each line according to comma.

dataConfig
dataSource type=FileDataSource encoding=UTF-8 name=fds/
document
   entity name=f processor=FileListEntityProcessor fileName=.*txt 
baseDir=/Volumes/data/Documents recursive=false rootEntity=false 
dataSource=null transformer=TemplateTransformer 
 entity onError=skip name=jc   processor=LineEntityProcessor 
url=${f.fileAbsolutePath} dataSource=fds  rootEntity=true 
transformer=TemplateTransformer
 field column=link 
template=hello${f.fileAbsolutePath},${jc.rawLine} /
 field column=rawLine name=rawLine /
 /entity 
/entity
/document
/dataConfig



On Friday, October 10, 2014 12:26 AM, nabil Kouici koui...@yahoo.fr wrote:
Hi Ahmet,

Thank you for this replay. Agree with you that csv update handler is fast but 
we need always to specify columns in the http request. In addition, I don't 
find documentation how to use csv update from solrj.

Could you please send me an example of DIH to load CSV file?

Regards,
Nabil.





Le Jeudi 9 octobre 2014 21h05, Ahmet Arslan iori...@yahoo.com.INVALID a écrit 
:



Hi Nabil,

whats wrong with csv update handler? It is quite fast.

By the way DIH has line entity processor, yes it is doable with existing DIH 
components.

Ahmet



On Thursday, October 9, 2014 9:58 PM, nabil Kouici koui...@yahoo.fr wrote:





Hi All,

Is it possible to have in solr a DIH to load from CSV file. Actually I'm using 
update/csv handler but not responding to my need.

Regards,
NKI.

Re: data import handler clarifications/ pros and cons.

2014-10-07 Thread Durga Palamakula

There is a built in scheduling @
http://wiki.apache.org/solr/DataImportHandler#Scheduling

But as others have mentioned cron is the simplest.

On Mon, Oct 6, 2014 at 8:56 PM, Karunakar Reddy karunaka...@gmail.com
wrote:

 Thanks Shawn and Gora for your  suggestions.
 @Gora sounds good. I am just getting clarity over it.


 Regards,
 Karunakar.

 On Tue, Oct 7, 2014 at 8:27 AM, Gora Mohanty g...@mimirtech.com wrote:

  On 6 October 2014 18:40, Karunakar Reddy karunaka...@gmail.com wrote:
  
   Hey Alex,
   Thanks for your reply.
   Is delta-import handler configurable? say if I want to update documents
   every 20 mins is it possible through any configuration/settings like
   autocommit?
 
  As a delta-import involves loading a URL, you can do this through a
  scheduler
  on your OS. On Linux, we have a cron job that uses curl. I do not see a
 big
  argument for Solr to include a scheduler.
 
  Regards,
  Gora
 




-- 
Follow us @NEOGOV http://twitter.com/NEOGOV and on Facebook
http://www.facebook.com/neogov

NEOGOV http://www.neogov.com/ is among the top fastest growing software
companies in the USA, recognized by Inc 500|5000, Delloitte Fast 500, and
the LA Business Journal. We are hiring! http://www.neogov.com/careers

Re: data import handler clarifications/ pros and cons.

2014-10-07 Thread Ahmet Arslan



Hi Durga,

That wiki talks about an uncommitted code. So it is not built in.

Ahmet


On Tuesday, October 7, 2014 7:17 PM, Durga Palamakula dpalamak...@neogov.net 
wrote:
There is a built in scheduling @
http://wiki.apache.org/solr/DataImportHandler#Scheduling

But as others have mentioned cron is the simplest.




On Mon, Oct 6, 2014 at 8:56 PM, Karunakar Reddy karunaka...@gmail.com
wrote:

 Thanks Shawn and Gora for your  suggestions.
 @Gora sounds good. I am just getting clarity over it.


 Regards,
 Karunakar.

 On Tue, Oct 7, 2014 at 8:27 AM, Gora Mohanty g...@mimirtech.com wrote:

  On 6 October 2014 18:40, Karunakar Reddy karunaka...@gmail.com wrote:
  
   Hey Alex,
   Thanks for your reply.
   Is delta-import handler configurable? say if I want to update documents
   every 20 mins is it possible through any configuration/settings like
   autocommit?
 
  As a delta-import involves loading a URL, you can do this through a
  scheduler
  on your OS. On Linux, we have a cron job that uses curl. I do not see a
 big
  argument for Solr to include a scheduler.
 
  Regards,
  Gora
 




-- 
Follow us @NEOGOV http://twitter.com/NEOGOV and on Facebook
http://www.facebook.com/neogov

NEOGOV http://www.neogov.com/ is among the top fastest growing software
companies in the USA, recognized by Inc 500|5000, Delloitte Fast 500, and
the LA Business Journal. We are hiring! http://www.neogov.com/careers

Re: data import handler clarifications/ pros and cons.

2014-10-07 Thread Gora Mohanty

On 8 October 2014 01:00, Ahmet Arslan iori...@yahoo.com.invalid wrote:



 Hi Durga,

 That wiki talks about an uncommitted code. So it is not built in.

Maybe it is just me, but given that there are existing scheduling
solutions in most operating systems, I fail to understand why
people expect Solr to expand to include that. How would that
fit into Solr's goals?

IMHO, going by the argument that Solr should also do whatever
anyone could want, one could replace M-x hail-emacs with
M-x hail-solr-lucene.

Regards,
Gora

Re: data import handler clarifications/ pros and cons.

2014-10-06 Thread Shawn Heisey

On 10/6/2014 5:09 AM, Karunakar Reddy wrote:
 Please suggest me effective way of using data import handler.
 
 Here is my use case.
 
 I have different kind of items which needs to be indexed in solr . Eg(
 books, shoes,electronics etc... ) each one has in different relational
 table.
 I have only one core as of now which is been used for public search and for
 other search pages like (book search page/ electronics search page..)
 and updates are happening through indexing script which we are maintaining
 internally  .
 We are planning to use DIH(data import handler).
 
 1)Is it best way to use DIH/over indexing script? any pros and cons of
 using DIH?
 
 2) How can we index different type of documents(books,electronic..  the
 data is there in different tables in mysql ) through document import
 handler?
 
 3)What is the best way to do delta-import.? how do we fire delta-import
 request? is there any thing like auto delta import like autocommit?

If you already have an effective indexing method that does everything
you need, I would suggest sticking with it.

I think of DIH as stopgap feature, a way to get started with Solr when
using a structured data store, until you can write your own indexing
procedure that is highly tailored to your situation.  I'm actually still
using DIH for full reindexes, controlled with SolrJ, but I have grand
designs for replacing it with a multi-threaded approach that hopefully
will be much faster.

DIH is a fairly efficient single-threaded way of accessing a single flat
table space from a database.  As soon as you try to make it include
multiple and/or nested entities, its performance will often drop
significantly.  If you can reduce all of your interaction with the
database to as single SELECT call -- using joins, a stored procedure, or
something similar, then you MIGHT be able to use DIH effectively.  The
DIH handler on each of my shards uses exactly one SELECT call.

There is currently no DIH scheduler built-in to Solr.  There are two
reasons that the idea has met with resistance:

1) There is already a built-in scheduling apparatus on *every* modern
operating system, one that has been tested, debugged, and is generally
bulletproof.  If a feature like that is built into Solr, users will be
unhappy if it doesn't work as advertised because we made a mistake in
the code.  I'd rather rely on an OS feature that's been around for
multiple decades.

2) As a group, the developers are resistant to features that would cause
Solr to make changes in the index without being *told* to do it by an
outside force.  There is already an issue in Jira for a DIH scheduler,
but the patch hasn't been committed.  Some developers would like to
include it.

Thanks,
Shawn

Re: data import handler clarifications/ pros and cons.

2014-10-06 Thread Alexandre Rafalovitch

1) DIH looks like a match to your needs, yes. You just trigger it from
your script and then it does the rest of the work asynchronously. But
you'll to pull later for the status if you want to report on
success/failure.

2) Yes, you can just by defining several entities next to each other.
You can run them all or select them one by one. Just make sure to
define the delete queries correctly, so when you run one query it does
not delete other entity's content (default behaviour)

3) DIH supports delta-import. It's in the docs. Come back with more
detailed question if something is not clear there.

Regards,
Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 6 October 2014 07:09, Karunakar Reddy karunaka...@gmail.com wrote:
 Hi All,

 Please suggest me effective way of using data import handler.

 Here is my use case.

 I have different kind of items which needs to be indexed in solr . Eg(
 books, shoes,electronics etc... ) each one has in different relational
 table.
 I have only one core as of now which is been used for public search and for
 other search pages like (book search page/ electronics search page..)
 and updates are happening through indexing script which we are maintaining
 internally  .
 We are planning to use DIH(data import handler).

 1)Is it best way to use DIH/over indexing script? any pros and cons of
 using DIH?

 2) How can we index different type of documents(books,electronic..  the
 data is there in different tables in mysql ) through document import
 handler?

 3)What is the best way to do delta-import.? how do we fire delta-import
 request? is there any thing like auto delta import like autocommit?

 Please through be some light on this.

 Thanks  Regards,
 Karunakar

Re: data import handler clarifications/ pros and cons.

2014-10-06 Thread Alexandre Rafalovitch

On 6 October 2014 08:56, Shawn Heisey apa...@elyograg.org wrote:
 2) As a group, the developers are resistant to features that would cause
 Solr to make changes in the index without being *told* to do it by an
 outside force.  There is already an issue in Jira for a DIH scheduler,
 but the patch hasn't been committed.  Some developers would like to
 include it.


Just as a side-note (not DIH-related). The expiring documents
mechanism has a schedule, AFAIK

Regards,
  Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853

Re: data import handler clarifications/ pros and cons.

2014-10-06 Thread Karunakar Reddy

Hey Alex,
Thanks for your reply.
Is delta-import handler configurable? say if I want to update documents
every 20 mins is it possible through any configuration/settings like
autocommit?

Regards,
Karunakar.

On Mon, Oct 6, 2014 at 6:24 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 1) DIH looks like a match to your needs, yes. You just trigger it from
 your script and then it does the rest of the work asynchronously. But
 you'll to pull later for the status if you want to report on
 success/failure.

 2) Yes, you can just by defining several entities next to each other.
 You can run them all or select them one by one. Just make sure to
 define the delete queries correctly, so when you run one query it does
 not delete other entity's content (default behaviour)

 3) DIH supports delta-import. It's in the docs. Come back with more
 detailed question if something is not clear there.

 Regards,
 Alex.
 Personal: http://www.outerthoughts.com/ and @arafalov
 Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
 Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


 On 6 October 2014 07:09, Karunakar Reddy karunaka...@gmail.com wrote:
  Hi All,
 
  Please suggest me effective way of using data import handler.
 
  Here is my use case.
 
  I have different kind of items which needs to be indexed in solr . Eg(
  books, shoes,electronics etc... ) each one has in different relational
  table.
  I have only one core as of now which is been used for public search and
 for
  other search pages like (book search page/ electronics search page..)
  and updates are happening through indexing script which we are
 maintaining
  internally  .
  We are planning to use DIH(data import handler).
 
  1)Is it best way to use DIH/over indexing script? any pros and cons of
  using DIH?
 
  2) How can we index different type of documents(books,electronic..  the
  data is there in different tables in mysql ) through document import
  handler?
 
  3)What is the best way to do delta-import.? how do we fire delta-import
  request? is there any thing like auto delta import like autocommit?
 
  Please through be some light on this.
 
  Thanks  Regards,
  Karunakar

Re: data import handler clarifications/ pros and cons.

2014-10-06 Thread Gora Mohanty

On 6 October 2014 18:40, Karunakar Reddy karunaka...@gmail.com wrote:

 Hey Alex,
 Thanks for your reply.
 Is delta-import handler configurable? say if I want to update documents
 every 20 mins is it possible through any configuration/settings like
 autocommit?

As a delta-import involves loading a URL, you can do this through a scheduler
on your OS. On Linux, we have a cron job that uses curl. I do not see a big
argument for Solr to include a scheduler.

Regards,
Gora

Re: data import handler clarifications/ pros and cons.

2014-10-06 Thread Karunakar Reddy

Thanks Shawn and Gora for your  suggestions.
@Gora sounds good. I am just getting clarity over it.


Regards,
Karunakar.

On Tue, Oct 7, 2014 at 8:27 AM, Gora Mohanty g...@mimirtech.com wrote:

 On 6 October 2014 18:40, Karunakar Reddy karunaka...@gmail.com wrote:
 
  Hey Alex,
  Thanks for your reply.
  Is delta-import handler configurable? say if I want to update documents
  every 20 mins is it possible through any configuration/settings like
  autocommit?

 As a delta-import involves loading a URL, you can do this through a
 scheduler
 on your OS. On Linux, we have a cron job that uses curl. I do not see a big
 argument for Solr to include a scheduler.

 Regards,
 Gora

Re: Data Import handler and join select

2014-08-08 Thread Alejandro Marqués Rodríguez

First of all thank you very much for the answer, James. It is very complete
and it gives us several alternatives :)

I think we will try first the cache approach, as, after solving this
problem https://issues.apache.org/jira/browse/SOLR-5954 the performance has
been improved, so along with the cache solution we may achieve the expected
performance.

We've also tried modifying the transformers and we've got it working the
way we were looking for, though the solutions you propose seem to be much
cleaner.

Regarding indexing through solrj it was our first idea, the problem is when
we started the project, the DIH seemed to fit our needs perfectly, until we
tried with real data and realized about the performance issues, so, now
maybe it's a bit late for us trying to change everything :( If we have no
other option we will go that way but we need to try less drastic solutions
first.

Thanks!


2014-08-07 18:11 GMT+02:00 Dyer, James james.d...@ingramcontent.com:

 Alejandro,

 You can use a sub-entity with a cache using DIH.  This will solve the
 n+1-select problem and make it run quickly.  Unfortunately, the only
 built-in cache implementation is in-memory so it doesn't scale.  There is a
 fast, disk-backed cache using bdb-je, which I use in production.  See
 https://issues.apache.org/jira/browse/SOLR-2613 .  You will need to build
 this youself and include it on the classpath, and obtain a copy of bdb-je
 from Oracle.  While bdb-je is open source, its license is incompatible with
 ASL so this will never officially be part of Solr.

 Once you have a disk-backed cache, you can specify it on the child entity
 like this:
 entity name=parent query=select id, ... from parent table
 entity
 name=child
 query=select foreignKey, ... from child_table
 cacheKey=foreignKey
 cacheLookup=parent.id
 processor=SqlEntityProcessor
 transformer=...
 cacheImpl=BerkleyBackedCache
 /
 /entity

 If you don't want to go down this path, you can achieve this all with one
 query, if you include and ORDER BY to sort by whatever field is used as
 Solr's uniqueKey, and add a dummy row at the end with a UNION:

 SELECT p.uniqueKey, ..., 'A' as lastInd from PRODUCTS p
 INNER JOIN DESCRIPTIONS d ON p.uniqueKey = d.productKey
 UNION SELECT 0 as uniqueKey, ... , 'B' as lastInd from dual
 ORDER BY uniqueKey, lastInd

 Then your transformer would need to keep the lastUniqueKey in an
 instance variable and keep a running map of everything its seen for that
 key.  When the key changes, or if on the last row, send that map as the
 document.  Otherwise, the transformer returns null.  This will collect data
 from each row seen onto one document.

 Keep in mind also, that in a lot of cases like this, it might just be
 easiest to write a program that uses solrj to send your documents rather
 than trying to make DIH's features fit your use-case.

 James Dyer
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Alejandro Marqués Rodríguez [mailto:
 amarq...@paradigmatecnologico.com]
 Sent: Thursday, August 07, 2014 1:43 AM
 To: solr-user@lucene.apache.org
 Subject: Data Import handler and join select

 Hi,

 I have one problem while indexing with data import hadler while doing a
 join select. I have two tables, one with products and another one with
 descriptions for each product in several languages.

 So it would be:

 Products: ID, NAME, BRAND, PRICE, ...
 Descriptions: ID, LANGUAGE, DESCRIPTION

 I would like to have every product indexed as a document with a multivalued
 field language which contains every language that has an associated
 description and several dinamic fields description_ one for each
 language.

 So it would be for example:

 Id: 1
 Name: Product
 Brand: Brand
 Price: 10
 Languages: [es,en]
 Description_es: Descripción en español
 Description_en: English description

 Our first approach was using sub-entities for the data import handler and
 after implementing some transformers we had everything indexed as we
 wanted. The sub-entity process added the descriptions for each language to
 the solr document and then indexed them.

 The problem was performance. I've read that using sub-entities affected
 performance greatly, so we changed our process in order to use a join
 instead.

 Performance was greatly improved this way but now we have a problem. Each
 time a row is processed a solr document is generated and indexed into solr,
 but the data is not added to any previous data, but it replaces it.

 If we had the previous example the query resulting from the join would be:

 Id - Name - Brand - Price - Language - Description
 1 - Product - Brand - 10 - es - Descripción en español
 1 - Product - Brand - 10 - en - English description

 So when indexing as both have the same id the only information I get is the
 second row.

 Is there any way for data import handler to manage this and allow

RE: Data Import handler and join select

2014-08-07 Thread Dyer, James

Alejandro,

You can use a sub-entity with a cache using DIH.  This will solve the 
n+1-select problem and make it run quickly.  Unfortunately, the only built-in 
cache implementation is in-memory so it doesn't scale.  There is a fast, 
disk-backed cache using bdb-je, which I use in production.  See 
https://issues.apache.org/jira/browse/SOLR-2613 .  You will need to build this 
youself and include it on the classpath, and obtain a copy of bdb-je from 
Oracle.  While bdb-je is open source, its license is incompatible with ASL so 
this will never officially be part of Solr.

Once you have a disk-backed cache, you can specify it on the child entity like 
this:
entity name=parent query=select id, ... from parent table
entity 
name=child 
query=select foreignKey, ... from child_table
cacheKey=foreignKey 
cacheLookup=parent.id
processor=SqlEntityProcessor 
transformer=...
cacheImpl=BerkleyBackedCache
/
/entity

If you don't want to go down this path, you can achieve this all with one 
query, if you include and ORDER BY to sort by whatever field is used as Solr's 
uniqueKey, and add a dummy row at the end with a UNION:

SELECT p.uniqueKey, ..., 'A' as lastInd from PRODUCTS p 
INNER JOIN DESCRIPTIONS d ON p.uniqueKey = d.productKey
UNION SELECT 0 as uniqueKey, ... , 'B' as lastInd from dual 
ORDER BY uniqueKey, lastInd

Then your transformer would need to keep the lastUniqueKey in an instance 
variable and keep a running map of everything its seen for that key.  When the 
key changes, or if on the last row, send that map as the document.  Otherwise, 
the transformer returns null.  This will collect data from each row seen onto 
one document.

Keep in mind also, that in a lot of cases like this, it might just be easiest 
to write a program that uses solrj to send your documents rather than trying to 
make DIH's features fit your use-case.  

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: Alejandro Marqués Rodríguez [mailto:amarq...@paradigmatecnologico.com] 
Sent: Thursday, August 07, 2014 1:43 AM
To: solr-user@lucene.apache.org
Subject: Data Import handler and join select

Hi,

I have one problem while indexing with data import hadler while doing a
join select. I have two tables, one with products and another one with
descriptions for each product in several languages.

So it would be:

Products: ID, NAME, BRAND, PRICE, ...
Descriptions: ID, LANGUAGE, DESCRIPTION

I would like to have every product indexed as a document with a multivalued
field language which contains every language that has an associated
description and several dinamic fields description_ one for each language.

So it would be for example:

Id: 1
Name: Product
Brand: Brand
Price: 10
Languages: [es,en]
Description_es: Descripción en español
Description_en: English description

Our first approach was using sub-entities for the data import handler and
after implementing some transformers we had everything indexed as we
wanted. The sub-entity process added the descriptions for each language to
the solr document and then indexed them.

The problem was performance. I've read that using sub-entities affected
performance greatly, so we changed our process in order to use a join
instead.

Performance was greatly improved this way but now we have a problem. Each
time a row is processed a solr document is generated and indexed into solr,
but the data is not added to any previous data, but it replaces it.

If we had the previous example the query resulting from the join would be:

Id - Name - Brand - Price - Language - Description
1 - Product - Brand - 10 - es - Descripción en español
1 - Product - Brand - 10 - en - English description

So when indexing as both have the same id the only information I get is the
second row.

Is there any way for data import handler to manage this and allow the
documents to be indexed updating any previous data?

Thanks in advance



-- 
Alejandro Marqués Rodríguez

Paradigma Tecnológico
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42

Re: Data Import Handler - resource not found - Jetty - Windows 7

2014-07-25 Thread Shawn Heisey

On 7/25/2014 1:06 AM, Yavar Husain wrote:
 Have most of experience working on Solr with Tomcat. However I recently
 started with Jetty. I am using Solr 4.7.0 on Windows 7. I have configured
 solr properly and am able to see the admin UI as well as velocity browse.
 Dataimporthandler screen is also getting displayed. However when I do a
 full import it fails with the following error:
 
 INFO  - 2014-07-25 12:28:35.177; org.apache.solr.core.SolrCore;
 [collection1] webapp=/solr path=/dataimport
 params={indent=truecommand=status_=1406271515176wt=json} status=0
 QTime=0
 ERROR - 2014-07-25 12:28:35.179; org.apache.solr.common.SolrException;
 java.io.IOException: Can't find resource
 'C:/solr-4.7.0/example/solr/collection1/conf' in classpath or
 'C:\solr-4.7.0\example\solr\collection1\conf'
 at
 org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:342)
 at
 org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:134)

In 4.7.0, line 134 of DataImportHandler.java is concerned with locating
the config file for the dataimport handler.  In the following excerpt
from a solrconfig.xml file included with Solr, the config file is
db-data-config.xml.  What do you have for this in your solrconfig.xml?

   requestHandler name=/dataimport
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=configdb-data-config.xml/str
/lst
  /requestHandler

Thanks,
Shawn

RE: Data Import Handler

2013-11-13 Thread Ramesh

James can elaborate how to process driver=${dataimporter.request.driver} 
url =${dataimporter.request.url} and all where to mention these 
my purpose is to config my DB Details(url,uname,password) in properties file

-Original Message-
From: Dyer, James [mailto:james.d...@ingramcontent.com] 
Sent: Wednesday, November 06, 2013 7:42 PM
To: solr-user@lucene.apache.org
Subject: RE: Data Import Handler

If you prepend the variable name with dataimporter.request, you can
include variables like these as request parameters:

dataSource name=ds driver=${dataimporter.request.driver}
url=${dataimporter.request.url} /

/dih?driver=some.driver.classurl=jdbc:url:something

If you want to include these in solrcore.properties, you can additionally
add each property to solrconfig.xml like this:

requestHandler name=/dih
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=driver${dih.driver}/str
str name=url${dih.url}/str
/lst
/requestHandler

Then in solrcore.properties:
 dih.driver=some.driver.class
 dih.url=jdbc:url:something

See http://wiki.apache.org/solr/SolrConfigXml?#System_property_substitution


James Dyer
Ingram Content Group
(615) 213-4311

-Original Message-
From: Ramesh [mailto:ramesh.po...@vensaiinc.com]
Sent: Wednesday, November 06, 2013 7:25 AM
To: solr-user@lucene.apache.org
Subject: Data Import Handler

Hi Folks,

 

Can anyone suggest me how can customize dataconfig.xml file 

I want to provide database details like( db_url,uname,password ) from my own
properties file instead of dataconfig.xaml file

RE: Data Import Handler

2013-11-13 Thread Dyer, James

In solrcore.properties, put:

datasource.url=jdbc:xxx:yyy
datasource.driver=com.some.driver

In solrconfig.xml, put:

requestHandler name=/dih 
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
... 
str name=dsDriver${datasource.driver}/str
str name=dsUrl${datasource.url}/str
...
/lst
/requestHandler

In data-config.xml, put:
dataSource name=ds driver=${dataimporter.request.dsDriver} 
url=${dataimporter.request.dsUrl} /

Hope this works for you.

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: Ramesh [mailto:ramesh.po...@vensaiinc.com] 
Sent: Wednesday, November 13, 2013 9:00 AM
To: solr-user@lucene.apache.org
Subject: RE: Data Import Handler

James can elaborate how to process driver=${dataimporter.request.driver} 
url =${dataimporter.request.url} and all where to mention these 
my purpose is to config my DB Details(url,uname,password) in properties file

-Original Message-
From: Dyer, James [mailto:james.d...@ingramcontent.com] 
Sent: Wednesday, November 06, 2013 7:42 PM
To: solr-user@lucene.apache.org
Subject: RE: Data Import Handler

If you prepend the variable name with dataimporter.request, you can
include variables like these as request parameters:

dataSource name=ds driver=${dataimporter.request.driver}
url=${dataimporter.request.url} /

/dih?driver=some.driver.classurl=jdbc:url:something

If you want to include these in solrcore.properties, you can additionally
add each property to solrconfig.xml like this:

requestHandler name=/dih
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=driver${dih.driver}/str
str name=url${dih.url}/str
/lst
/requestHandler

Then in solrcore.properties:
 dih.driver=some.driver.class
 dih.url=jdbc:url:something

See http://wiki.apache.org/solr/SolrConfigXml?#System_property_substitution


James Dyer
Ingram Content Group
(615) 213-4311

-Original Message-
From: Ramesh [mailto:ramesh.po...@vensaiinc.com]
Sent: Wednesday, November 06, 2013 7:25 AM
To: solr-user@lucene.apache.org
Subject: Data Import Handler

Hi Folks,

 

Can anyone suggest me how can customize dataconfig.xml file 

I want to provide database details like( db_url,uname,password ) from my own
properties file instead of dataconfig.xaml file

RE: Data Import Handler

2013-11-13 Thread Ramesh

Need to be put out of solr like 

customized Mysolr_core.properties
how to access it

-Original Message-
From: Dyer, James [mailto:james.d...@ingramcontent.com] 
Sent: Wednesday, November 13, 2013 8:50 PM
To: solr-user@lucene.apache.org
Subject: RE: Data Import Handler

In solrcore.properties, put:

datasource.url=jdbc:xxx:yyy
datasource.driver=com.some.driver

In solrconfig.xml, put:

requestHandler name=/dih
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
... 
str name=dsDriver${datasource.driver}/str
str name=dsUrl${datasource.url}/str
...
/lst
/requestHandler

In data-config.xml, put:
dataSource name=ds driver=${dataimporter.request.dsDriver}
url=${dataimporter.request.dsUrl} /

Hope this works for you.

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: Ramesh [mailto:ramesh.po...@vensaiinc.com]
Sent: Wednesday, November 13, 2013 9:00 AM
To: solr-user@lucene.apache.org
Subject: RE: Data Import Handler

James can elaborate how to process driver=${dataimporter.request.driver} 
url =${dataimporter.request.url} and all where to mention these my purpose
is to config my DB Details(url,uname,password) in properties file

-Original Message-
From: Dyer, James [mailto:james.d...@ingramcontent.com]
Sent: Wednesday, November 06, 2013 7:42 PM
To: solr-user@lucene.apache.org
Subject: RE: Data Import Handler

If you prepend the variable name with dataimporter.request, you can
include variables like these as request parameters:

dataSource name=ds driver=${dataimporter.request.driver}
url=${dataimporter.request.url} /

/dih?driver=some.driver.classurl=jdbc:url:something

If you want to include these in solrcore.properties, you can additionally
add each property to solrconfig.xml like this:

requestHandler name=/dih
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=driver${dih.driver}/str
str name=url${dih.url}/str
/lst
/requestHandler

Then in solrcore.properties:
 dih.driver=some.driver.class
 dih.url=jdbc:url:something

See http://wiki.apache.org/solr/SolrConfigXml?#System_property_substitution


James Dyer
Ingram Content Group
(615) 213-4311

-Original Message-
From: Ramesh [mailto:ramesh.po...@vensaiinc.com]
Sent: Wednesday, November 06, 2013 7:25 AM
To: solr-user@lucene.apache.org
Subject: Data Import Handler

Hi Folks,

 

Can anyone suggest me how can customize dataconfig.xml file 

I want to provide database details like( db_url,uname,password ) from my own
properties file instead of dataconfig.xaml file

Re: Data Import Handler

2013-11-06 Thread Peter Keegan

I've done this by adding an attribute to the entity element (e.g.
myconfig=myconfig.xml), and reading it in the 'init' method with
context.getResolvedEntityAttribute(myconfig).

Peter


On Wed, Nov 6, 2013 at 8:25 AM, Ramesh ramesh.po...@vensaiinc.com wrote:

 Hi Folks,



 Can anyone suggest me how can customize dataconfig.xml file

 I want to provide database details like( db_url,uname,password ) from my
 own
 properties file instead of dataconfig.xaml file

Re: Data Import Handler

2013-11-06 Thread Giovanni

I configured a data source in tomcat and referenced it by its jdbc name.

So dev and production sites shares the same config file but uses different dbs

I hope this helps



 Il giorno 06/nov/2013, alle ore 13:25, Ramesh ramesh.po...@vensaiinc.com 
 ha scritto:
 
 Hi Folks,
 
 
 
 Can anyone suggest me how can customize dataconfig.xml file 
 
 I want to provide database details like( db_url,uname,password ) from my own
 properties file instead of dataconfig.xaml file

RE: Data Import Handler

2013-11-06 Thread Dyer, James

If you prepend the variable name with dataimporter.request, you can include 
variables like these as request parameters:

dataSource name=ds driver=${dataimporter.request.driver} 
url=${dataimporter.request.url} /

/dih?driver=some.driver.classurl=jdbc:url:something

If you want to include these in solrcore.properties, you can additionally add 
each property to solrconfig.xml like this:

requestHandler name=/dih 
class=org.apache.solr.handler.dataimport.DataImportHandler
lst name=defaults
str name=driver${dih.driver}/str
str name=url${dih.url}/str
/lst
/requestHandler

Then in solrcore.properties:
 dih.driver=some.driver.class
 dih.url=jdbc:url:something

See http://wiki.apache.org/solr/SolrConfigXml?#System_property_substitution


James Dyer
Ingram Content Group
(615) 213-4311

-Original Message-
From: Ramesh [mailto:ramesh.po...@vensaiinc.com] 
Sent: Wednesday, November 06, 2013 7:25 AM
To: solr-user@lucene.apache.org
Subject: Data Import Handler

Hi Folks,

 

Can anyone suggest me how can customize dataconfig.xml file 

I want to provide database details like( db_url,uname,password ) from my own
properties file instead of dataconfig.xaml file

Re: Data import handler with multi tables

2013-10-30 Thread Stefan Matheis

that is what i'd call a compound key? :) using multiple attribute to generate a 
unique key across multiple tables ..


On Wednesday, October 30, 2013 at 2:10 AM, dtphat wrote:

 yes, I've just used concat(id, '_', tableName) instead using compound key. I
 think this is an easy way.
 Thanks.
 
 
 
 -
 Phat T. Dong
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Re-Data-import-handler-with-multi-tables-tp4098048p4098328.html
 Sent from the Solr - User mailing list archive at Nabble.com 
 (http://Nabble.com).

Re: Data import handler with multi tables

2013-10-29 Thread Stefan Matheis

I've never looked for another way, what's the problem using a compound key?


On Monday, October 28, 2013 at 1:38 PM, dtphat wrote:

 Hi,
 is there no another way to import all data for this case instead Only the
 way using compound key?
 Thanks.
 
 
 
 -
 Phat T. Dong
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Re-Data-import-handler-with-multi-tables-tp4098048p4098056.html
 Sent from the Solr - User mailing list archive at Nabble.com 
 (http://Nabble.com).

1 2 >

1 - 100 of 153 matches

Mail list logo