child doc filter

2016-11-03 Thread Tim Williams
I'm using the BlockJoinQuery to query child docs and return the
parent.  I'd like to have the equivalent of a filter that applies to
child docs and I don't see a way to do that with the BlockJoin stuffs.
It looks like I could modify it to accept some childFilter param and
add a QueryWrapperFilter right after the child query is created[1] but
before I did that, I wanted to see if there's a built-in way to
achieve the same behavior?

Thanks,
--tim

[1] - 
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/search/join/BlockJoinParentQParser.java#L69


Re: Running Lucene/SOR on Hadoop

2016-01-04 Thread Tim Williams
Apache Blur (Incubating) has several approaches (hive, spark, m/r)
that could probably help with this ranging from very experimental to
stable.  If you're interested, you can ask over on
blur-u...@incubator.apache.org ...

Thanks,
--tim

On Fri, Dec 25, 2015 at 4:28 AM, Dino Chopins  wrote:
> Hi Erick,
>
> Thank you for your response and pointer. What I mean by running Lucene/SOLR
> on Hadoop is to have Lucene/SOLR index available to be queried using
> mapreduce or any best practice recommended.
>
> I need to have this mechanism to do large scale row deduplication. Let me
> elaborate why I need this:
>
>1. I have two data sources with 35 and 40 million records of customer
>profile - the data come from two systems (SAP and MS CRM)
>2. Need to index and compare row by row of the two data sources using
>name, address, birth date, phone and email field. For birth date and email
>it will use exact comparison, but for the other fields will use
>probabilistic comparison. Btw, the data has been normalized before they are
>being indexed.
>3. Each finding will be categorized under same person, and will be
>deduplicated automatically or under user intervention depending on the
>score.
>
> I usually use it using Lucene index on local filesystem and use term
> vector, but since this will be repeated task and then challenged by
> management to do this on top of Hadoop cluster I need to have a framework
> or best practice to do this.
>
> I understand that to have Lucene index on HDFS is not very appropriate
> since HDFS is designed for large block operation. With that understanding,
> I use SOLR and hope to query it using http call from mapreduce job.  The
> snippet code is below.
>
> url = new URL(SOLR-Query-URL);
>
> HttpURLConnection connection = (HttpURLConnection)
> url.openConnection();
> connection.setRequestMethod("GET");
>
> The later method turns out to perform very bad. The simple mapreduce job
> that only read the data sources and write to hdfs takes 15 minutes, but
> once I do the http request it takes three hours now and still ongoing.
>
> What went wrong? And what will be solution to my problem?
>
> Thanks,
>
> Dino
>
> On Mon, Dec 14, 2015 at 12:30 AM, Erick Erickson 
> wrote:
>
>> First, what do you mean "run Lucene/Solr on Hadoop"?
>>
>> You can use the HdfsDirectoryFactory to store Solr/Lucene
>> indexes on Hadoop, at that point the actual filesystem
>> that holds the index is transparent to the end user, you just
>> use Solr as you would if it was using indexes on the local
>> file system. See:
>> https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
>>
>> If you want to use Map-Reduce to _build_ indexes, see the
>> MapReduceIndexerTool in the Solr contrib area.
>>
>> Best,
>> Erick
>>
>
>
>
>
> --
> Regards,
>
> Dino


Re: REST calls

2010-06-30 Thread Tim Williams
On Wed, Jun 30, 2010 at 12:39 AM, Don Werve d...@madwombat.com wrote:
 2010/6/27 Jason Chaffee jchaf...@ebates.com

 The solr docs say it is RESTful, yet it seems that it doesn't use http
 headers in a RESTful way.  For example, it doesn't seem to use the Accept:
 request header to determine the media-type to be returned.  Instead, it
 requires a query parameter to be used in the URL.  Also, it doesn't seem to
 use return 304 Not Modified if the request header if-modified-since is
 used.


 The summary:

 Solr is restful, and does a very good job of it.

I'm not so sure...

 The long version:

 There is no official 'REST' standard that dictates the behavior of the
 implementation; rather, REST is a set of guidelines on building APIs that
 are both discoverable and easily usable without having to resort to
 third-party libraries.

 Generally speaking, an application is RESTful if it provides an API that
 accepts arguments passed as HTTP form variables, returns results in an open
 format (XML, JSON, YAML, etc.), and respects certain semantics relating to
 HTTP verbs; e.g., GET/HEAD return the resource without modification, DELETEs
 are destructive, POST creates a resource, PUT alters it.

 Solr meets all of these requirements.

With fairly limited knowledge of Solr (I'm a lucene user), I'd like to
offer an alternate view.

- Solr seems to violate the hypermedia-driven constraint.  (e.g. it
seems not to be hypertext driven at all) [1]

- Solr seems to violate the uniform interface constraint and the
identification of resources constraint.(E.g. by having commands in
the entity body instead of exposing resources with state that is
manipulated through the standard methods and, I gather, overloading
methods instead of using standard ones (e.g. deletes).

I'd conclude Solr is not RESTful.

The representation argument is a bit of a red-herring, btw.  Not using
Accept for conneg isn't the problem, using agent-driven negotiation
without being hypertext driven is [from a REST pov].

--tim

[1] - http://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven


Re: REST calls

2010-06-30 Thread Tim Williams
On Wed, Jun 30, 2010 at 9:17 AM, Jak Akdemir jakde...@gmail.com wrote:
 On Wed, Jun 30, 2010 at 7:39 AM, Don Werve d...@madwombat.com wrote:

 2010/6/27 Jason Chaffee jchaf...@ebates.com

  The solr docs say it is RESTful, yet it seems that it doesn't use http
  headers in a RESTful way.  For example, it doesn't seem to use the
 Accept:
  request header to determine the media-type to be returned.  Instead, it
  requires a query parameter to be used in the URL.  Also, it doesn't seem
 to
  use return 304 Not Modified if the request header if-modified-since is
  used.
 

 The summary:

 Solr is restful, and does a very good job of it.

 The long version:

 There is no official 'REST' standard that dictates the behavior of the
 implementation; rather, REST is a set of guidelines on building APIs that
 are both discoverable and easily usable without having to resort to
 third-party libraries.

 Generally speaking, an application is RESTful if it provides an API that
 accepts arguments passed as HTTP form variables, returns results in an open
 format (XML, JSON, YAML, etc.), and respects certain semantics relating to
 HTTP verbs; e.g., GET/HEAD return the resource without modification,
 DELETEs
 are destructive, POST creates a resource, PUT alters it.


 Actually it is not a constraint to use all of four *GET*, *PUT*, *POST*, *
 DELETE.*
 To define RESTful, using Get and Post requests are enough as Roy Fielding
 offered.
 http://roy.gbiv.com/untangled/2009/it-is-okay-to-use-post

In Roy's post, I'd point out: POST only becomes an issue when it is
used in a situation for which some other method is ideally suited
(e.g. DELETE to delete).

Also, GET and POST *could* be enough if and only if you took care to
design your resources properly[1].

--tim

[1] - http://www.amundsen.com/blog/archives/1063